Text Analysis – Page 16

IBM’s “Watson” Computing System to Challenge All Time Greatest Jeopardy! Champions

Richard drew my attention to the upcoming competition between IBM’s Watson deep question and answer system and top Jeopardy! champions, IBM’s “Watson” Computing System to Challenge All Time Greatest Jeopardy! Champions. I’d blogged on Watson before – it’s a custom system designed to mine large collections of data for answers to questions. Here is what IBM says its applications are,

Beyond Jeopardy!, the technology behind Watson can be adapted to solve problems and drive progress in various fields. The computer has the ability to sift through vast amounts of data and return precise answers, ranking its confidence in its answers. The technology could be applied in areas such as healthcare, to help accurately diagnose patients, to improve online self-service help desks, to provide tourists and citizens with specific information regarding cities, prompt customer support via phone, and much more.

TagCrowd

created at TagCrowd.com

TagCrowd is another web based word cloud generator that seems clean and works on URLs, uploaded files, and pasted files. They also offer a commercial version for a small license fee.

Springer Realtime Visualizations

From Judith I discovered Springer Realtime visualizations. The image above is a visualization that shows you each download of a journal content as it happens. I wonder if one cold play a Tetris-like game with this. Go in and see it here. The others are fairly common.

Christopher Collins: Research

Ian pointed me to a cool visualization project by Collins and colleagues at UOIT called Docuburst. Docuburst uses WordNet and visualizes the distribution of words in a WordNet tree starting from a node you select. Their code is made available.

TAPoR portal has moved

The TAPoR Portal has moved to a new server at the University of Alberta. The new location will allow us here to start redesigning it and developing version 2.0. (Or is it now version 3.0?) I underestimated how much work it is to move something so complex. We had to work on bugs, we had to warn users, we had to set up hardware here. Kamal Ranaweera worked very hard to do this – Bravo!

Some links related to the move:

If you have trouble with the portal go to http://tada.mcmaster.ca/Main/TAPoRPortalMove for information
If you are interested in the redesign go to http://tada.mcmaster.ca/Main/TAPoRRedesign

NY Times: Humanities Scholars Embrace Digital Technology

The next big idea is data according to a New York Times article, Humanities Scholars Embrace Digital Technology by Patricia Cohen (November 16, 2010.) The article reports on some of the big data interpretation projects like those funded by the Digging Into Data program like the Mining with Criminal Intent project I am on.

Members of a new generation of digitally savvy humanists argue it is time to stop looking for inspiration in the next political or philosophical “ism” and start exploring how technology is changing our understanding of the liberal arts. This latest frontier is about method, they say, using powerful technologies and vast stores of digitized materials that previous humanities scholars did not have.

I’m not sure this is a new generation as we have been at this for a while, but perhaps the point is that the new generation is now looking away from theory towards the large-scale data issues.

What stands out about the projects mentioned and others is that the digital humanities and design fields are developing new and subtler forms of large-scale data mining and interpretation that use methods from other disciplines along with a sensitivity to the nature of the data and the questions we want to ask. The image above comes from Stanford’s Visualization of Republic of Letters project. There is nothing new about visualization or network analysis, but digital humanists are trying to adapt methods to messy human data – in other words interpreting the really interesting stuff so that it makes sense of something to someone.

Perhaps we may be able to show that following theses are true and important to the broader community:

Interesting data has to be interpreted to be interesting. Someone has to pose the questions that make data useful.
There is too much of data and it is messy; therefore it can’t by interpreted automatically. Real world analysis always involves questions, choices, data curation, mixing techniques, and iterative interpretation of results to generate knowledge.
Interesting data always has to be explained to someone in some context. Results are only useful knowledge if they are published in some fashion that makes them accessible to an intended audience.
Humanists have been the curators and interpreters of information which is why the subtle skills of questioning, curating, editing, analyzing, interpreting and representing are all the more needed now. Without humanists (and I include librarians and archivists in this category) who are comfortable with digital data and methods we will have only too much data and too many unused tools.

Thanks to Judith for pointing me to this NYT article.

2nd Edition of Icon Programming for Humanists

I just got a notice that the 2nd Edition of Icon Programming for Humanists (PDF) by Alan D. Corré has been up (and its free). This has been made available by Jeffery Books who will also sell you a paperback copy. Donations go to promoting Icon and Unicon programming languages and systems.

I read Icon Programming for Humanists ages ago. It was one of the few how-to-program books that were aimed at humanists with text manipulation examples. I thought the book excellent and was only held back because I couldn’t find an Icon interpreter for the Mac when I looked.

This edition has 2 new chapters that deal with Unicode (so you can analyze texts in different languages), and Markup (so you can work with TEI encoded texts.)

There is a recurring issue that crops up as to whether we should be teaching humanities students to program or just to use tools. Corré’s book would make a good textbook for teaching programming.

Text Analysis in the Wild: Steve Jobs’s Android Obsession Analyzed

I came across this example of text analysis in the wild using a wordle, Steve Jobs’s Android Obsession Analyzed. The short article is by David Zax in Fast Company (October 19, 2010.) Based on “Android” coming up as the highest frequency content word Zax reads obsession.

So yes, the Android weighs heavily upon Jobs’s mind; and his dreams are more than likely populated with ravenous green robots consuming everything in their path.

IMS Open Corpus Workbench

John pointed me to an interesting open source project, the IMS Open Corpus Workbench. This project has developed tools are for “managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.” Obviously it has a linguistics bent, but the tools seem to be well documented and usable.

You can see an example of an interesting interface to the Corpus Workbench at BwanaNet – a wizard-like interface where you go through 5 steps to get results on an English, Catalan, and Spanish corpus.

AlchemyAPI – Transforming Text Into Knowledge

Stéfan pointed me to the AlchemyAPI service. AlchemyAPI provides an API for extracting “information about people, places, companies, topics, languages” and concepts. They have a nice demo on the front page where they take a news a top news story, extract the entities and then create a spring-loaded graph of the named entities.

You can see that for this story the system found organizations, a city, countries and persons.

A free API key is available for up to 30,000 calls a day.