I’m at Digital Humanities 2011: Big Tent Digital Humanities at Stanford University. I was involved in two workshops before the conference on Visualization for Literary History and Text Analysis with Voyeur. (You can see the script to the Voyeur one at DH2011 Voyeur Tools.) I’m also involved in a paper on “Computing in Canada: A History of the Incunabular Years” presented by Victoria Smith and a panel on The Interface to the Collection organized by the INKE Interface Design team. One gratifying thing to see is the visibility of the University of Alberta in the DH 2011 Visualizations set up by the conference organizers. If you zoom in to the different visualizations you will see the number of participants from U of A.
Category: Text Analysis
Biblos.com: Community of Tools
Brent Nelson in a SDH-SEMI talk showed Biblos.com: Bible Tools. Biblos has developed a knowledge environment with all sorts of tools for studying the bible. It is a neat example of how a community can develop the tools it needs and share them.
Infomous Clouds
I was on The Atlantic site and noticed a neat visualization badge by Infomous. It is a variant on the usual word cloud that draws lines between related words and puts simple cloud circles around related words. As you can see it doesn’t always get the clouds right. On the left you have Japan connected to protesters and protesters connected to Syria. There is not, however, any connection between Japan and Syria except that protests are happening in both.
If you get an account Infomous lets you make your own clouds.
Update: Pablo Funes from Icosystem Corp sends this email comment on the post:
We use Mark Newman’s algorithm for network communities to identify clusters of news. In your example, Japan and Syria are both connected to “protesters” and therefore share the same cluster even though there are no news articles that bear on both Japan and Syria (so there is no direct connection between both terms). One could argue, with this example at least, that there is a worldwide series of events that have been unfolding over the last few months, with public protests as the visible common feature (Tunisia, Egypt, Libya, and so on) which makes the connection “countries where protests are happening” a relevant one. And yet, it is true that sometimes the connection is not relevant at all, as it happens when generic words, such as “video” or “said” for example, are shared across news stories.
Our Appinions-based clouds rely on sophisticated semantic analysis provided by Appinions.com (see http://www.infomous.com/site/events/JapanNuclear/). Here, topics are connected because they are discussed by the same web user in the same posting. We use the same algorithm to identify clusters in this network. You can turn off clustering by unchecking “groups” on the bottom toolbar.
Topicmarks – summarize your text documents in minutes
Thanks to Shawn Day’s Day of DH I learned aboutTopicmarks – summarize your text documents in minutes. It is a commercial version of a basic text analysis tool for summarizing readings. They emphasize how much time you spend not reading the whole document analyzed. It reminds me of a playful name we had for a prototype recommendation engine, “Write My Paper”. Look at the screen shot – some of the features they have that we had in TAPoR:
- Ability to paste text, use an URL, or upload a text
- Summarizer that combines different tool results
- Cooking metaphor (we have recipes)
To be honest, TopicMarks deserves points for a simple and clear interface and clear results. They don’t try to do everything. They are also clear on why you would use this (to save time reading.)
Digging Into Data: Second Round Announced
The second round of the Digging Into Data has just been announced and they now have one more country (the Netherlands) and eight international funders. (You can see the SSHRC Announcement here.)
The Digging Into Data challenge is an international grant program that funds groups that have teams in at least two countries so it is good that they are expanding the countries participating. What is even more extraordinary is that they have one adjudication process across all the funders (rather than an adjudication process where each national team has to apply to their own country’s program – which never works.)
I was part of one of the groups that got funding in the first round with the Criminal Intent project. I’ve found the collaboration very fruitful so I’m glad they are supporting this for another round.
Lancashire: Literary Alzheimer’s
In the category of things I meant to blog some time ago is Ian Lancashire and Graeme Hirst’s research into Agatha Christie’s Alzheimer’s-related dementia which was written up by the New York Times in their list of notable ideas for 2009. The write up is by Amanda Fortini, see Literary Alzheimer’s – The Ninth Annual Year in Ideas – Magazine. There is a longer article about this research by Judy Stoffman in the Insight section of the Toronto Star, An Agatha Christie mystery: Is Alzheimer’s on the page? (Jan. 23, 2010)
Lancashire’s specialty is the esoteric field of neuro-cognitive literary theory – in his words “what science says about the creative process versus what authors report about how they create their books.” He started to apply computer analysis to literary texts in 1982.
Ian Lancashire has links to the poster that first got attention and to a paper on his home page. He has also just published a book, Forgetful Muses; Reading the Author in the Text that develops his neuro-cognitive literary theory.
NYT: Armies of Expensive Lawyers, Replaced by Cheaper Software
The New York Times has an article about commercial text analysis systems by John Markoff, Armies of Expensive Lawyers, Replaced by Cheaper Software (March 5, 2011, A1 in New York Edition; March 4 online). He describes how companies are building systems that can analyze the immense amounts of documents shared in lawsuits. Traditionally an army of people would comb through the documents, “Now, thanks to advances in artificial intelligence, “e-discovery” software can analyze docuemnts in a fraction of the time for a fraction of the cost.”
Some programs go beyond just finding documents with relevant terms at computer speeds. They can extract relevant concepts — like documents relevant to social protest in the Middle East — even in the absence of specific terms, and deduce patterns of behavior that would have eluded lawyers examining millions of documents.
There is a nice graphic to accompany the article here. Markoff mentions companies like Blackstone Discovery and Cataphora. He also argues that the availability of a large email archive from Enron has made it possible for teams to experiment on a real dataset.
Index Thomisticus Glossary
Mihaela scanned for me a page from the manual for the CD-ROM version of the Index Thomisticus. We were able to get the CD-ROM version through inter library loan and capture some screen shots. This page is an English to Latin Microglossary for users providing computer terms in Latin. Click the image to expand.
IBM’s “Watson” Computing System to Challenge All Time Greatest Jeopardy! Champions
Richard drew my attention to the upcoming competition between IBM’s Watson deep question and answer system and top Jeopardy! champions, IBM’s “Watson” Computing System to Challenge All Time Greatest Jeopardy! Champions. I’d blogged on Watson before – it’s a custom system designed to mine large collections of data for answers to questions. Here is what IBM says its applications are,
Beyond Jeopardy!, the technology behind Watson can be adapted to solve problems and drive progress in various fields. The computer has the ability to sift through vast amounts of data and return precise answers, ranking its confidence in its answers. The technology could be applied in areas such as healthcare, to help accurately diagnose patients, to improve online self-service help desks, to provide tourists and citizens with specific information regarding cities, prompt customer support via phone, and much more.
TagCrowd
TagCrowd is another web based word cloud generator that seems clean and works on URLs, uploaded files, and pasted files. They also offer a commercial version for a small license fee.