Text Analysis – Page 17

Google: Our commitment to the digital humanities

Google has announced the first projects they are funding to use Google Books and have announced a commitment to the digital humanities of nearly a million dollars. See Official Google Blog: Our commitment to the digital humanities.

we’d like to see the field blossom and take advantage of resources such as Google Books that are becoming increasingly available. We’re pleased to announce that Google has committed nearly a million dollars to support digital humanities research over the next two years.

U of A text mining project could help businesses

Well, I made it into the computer press in Canada. An article on the Digging Into Data project I am working on has been published, see U of A text mining project could help businesses (Rafael Ruffolo, March 25, 2010 for ComputerWorld Canada.)

It is always interesting to see what the media find interesting in a story. They usually have a better idea of what their audience wants to read about so they adapt for that audience.

The General Inquirer

Reading John B. Smith’s “Computer Criticism”, (Style: Vol. XII, No. 4) I came a reference to a content analysis program called the The General Inquirer from the 1960s. This program still has a following and has been rewritten in Java. See the Inquirer Home Page. There is a web version where you can try it here (DO NOT USE A LARGE TEXT).

The General Inquirer “maps” a text to a thesaurus of categories, disambiguating on the way. The web page about How the General Inquirer is used describes what it does thus:

The General Inquirer is basically a mapping tool. It maps each text file with counts on dictionary-supplied categories. The currently distributed version combines the “Harvard IV-4” dictionary content-analysis categories, the “Lasswell” dictionary content-analysis categories, and five categories based on the social cognition work of Semin and Fiedler, making for 182 categories in all. Each category is a list of words and word senses. A category such as “self references” may contain only a dozen entries, mostly pronouns. Currently, the category “negative” is our largest with 2291 entries. Users can also add additional categories of any size.

As they say later on, their categories were developed for “social-science content-analysis research applications” and not for other uses like literary study. The original developer published a book on the tool in 1966:

Philip J. Stone, The General Inquirer: A Computer Approach to Content Analysis. (Cambridge: M. I. T. Press, 1966).

Ritsumeikan: Possibilities in Digital Humanities

The last week and a bit I have been in Kyoto to give a talk at a conference on the “Possibilities in Digital Humanities” which was organized by Professor Kozaburo Hachimura and sponsored by the Information Processing Society of Japan and by the Ritsumeikan University Digital Humanities Center for Japanese Arts and Culture.

While the talks were in Japanese I was able to follow most of the sessions with the help of Mistuyuki Inaba and Keiko Susuki. I was impressed by the quality of the research and the involvement of new scholars. There seemed to be a much higher participation of postdoctoral fellows and graduate students than at similar conferences in Canada which bodes well for digital humanities in Japan.

Continue reading Ritsumeikan: Possibilities in Digital Humanities

Teaching Literature and Language Online

A paper that Stéfan Sinclair and I wrote on “Between Language and Literature: Digital Text Exploration” has just be published by the MLA in a volume edited by Ian Lancashire, Teaching Literature and Language Online.

Information Visualization for Text Analysis

Googling around I came across a nice succinct chapter on Information Visualization for Text Analysis from a book called Search User Interfaces by Marti Hearst (Cambridge University Press, 2009).

The chapter goes from visualizations for text mining to concordances and then to citation relationships. It shows some of the usual suspects like TextArc and Wordle.

Text Analysis in the Wild

The Globe and Mail on November 13th had an interesting example of text analysis in the wild. Crossing pages A10 and A11 they had a box with the high frequency words in the old citizenship guide and the new one with a word cloud in the middle. Here is what the description says:

Discover Canada, a different look at the country

The new citizenship guide, Discover Canada, is much more comprehensive look at Canada’s history and system of government than its predecessor, A Look at Canada, which was produced under the Liberals in 1995. It’s longer (17,536 words to 10,433), with 10 pages devoted to Canadian history, compared to two in the previous version. Its emphasis also differs, with more attention paid to the military, the Crown and Quebec, and less to the environment.

>> Below is a graphi representation of the most frequently used words in the new citizendship guide. The bigger the word the more often it appears.

I had to fold the page to scan it as it is longer than my scanner, but you get the idea. The PDF is here. I would have preferred the two lists at either edge of the box to be closer to let us compare. Note the small print – they used May Eyes and WriteWords which has a word frequency counting tool.

Digital Humanities And Computer Science colloquium

I’m at the Digital Humanities And Computer Science (link to my report) colloquium at IIT in Chicago. Garry Wong and I gave a talk on the Big See project and designing visualizations for large-scale information displays. One of things that struck me is that we may be seeing the beginning of the end of digital humanities as a distinct field. Here is what I wrote in the conference report:

The End of Digital Humanities: I can’t help thinking (with just a little evidence) that the age of funding for digital humanities is coming to an end. Let me clarify this. My hunch is that the period when any reasonable digital humanities project seemed neat and innovative is coming to an end and that the funders are getting tired of more tool projects. I’m guessing that we will see a shift to funding content driven projects that use digital methodologies. Thus digital humanities programs may disappear and the projects are shunted into content areas like philosophy, English literature and so on. Accompanying this is a shift to thinking of digital humanities as infrastructure that therefore isn’t for research funding, but instead should be run as a service by professionals. This is the “stop reinventing wheel” argument and in some cases it is accompanied by coercive rhetoric to the effect that if you don’t get on the infrastructure bandwagon and use standards then you will be left out (or not funded.) I guess I am suggesting that we could be seeing a shift in what is considered legitimate research and what is considered closed and therefore ready for infrastructure. The tool project could be on the way out as research as it is moved as a problem into the domain of support (of infrastructure.) Is this a bad thing? It certainly will be a good thing if it leads to robust and widely usable technology. But could it be a cyclical trend where today’s research becomes tomorrows infrastructure to then be rediscovered later as a research problem all over.