Text Analysis in the Wild

Picture 4

The Globe and Mail on November 13th had an interesting example of text analysis in the wild. Crossing pages A10 and A11 they had a box with the high frequency words in the old citizenship guide and the new one with a word cloud in the middle. Here is what the description says:

Discover Canada, a different look at the country

The new citizenship guide, Discover Canada, is much more comprehensive look at Canada’s history and system of government than its predecessor, A Look at Canada, which was produced under the Liberals in 1995. It’s longer (17,536 words to 10,433), with 10 pages devoted to Canadian history, compared to two in the previous version. Its emphasis also differs, with more attention paid to the military, the Crown and Quebec, and less to the environment.

>> Below is a graphi representation of the most frequently used words in the new citizendship guide. The bigger the word the more often it appears.

I had to fold the page to scan it as it is longer than my scanner, but you get the idea. The PDF is here. I would have preferred the two lists at either edge of the box to be closer to let us compare. Note the small print – they used May Eyes and WriteWords which has a word frequency counting tool.

Digital Humanities And Computer Science colloquium

I’m at the Digital Humanities And Computer Science (link to my report) colloquium at IIT in Chicago. Garry Wong and I gave a talk on the Big See project and designing visualizations for large-scale information displays. One of things that struck me is that we may be seeing the beginning of the end of digital humanities as a distinct field. Here is what I wrote in the conference report:

The End of Digital Humanities: I can’t help thinking (with just a little evidence) that the age of funding for digital humanities is coming to an end. Let me clarify this. My hunch is that the period when any reasonable digital humanities project seemed neat and innovative is coming to an end and that the funders are getting tired of more tool projects. I’m guessing that we will see a shift to funding content driven projects that use digital methodologies. Thus digital humanities programs may disappear and the projects are shunted into content areas like philosophy, English literature and so on. Accompanying this is a shift to thinking of digital humanities as infrastructure that therefore isn’t for research funding, but instead should be run as a service by professionals. This is the “stop reinventing wheel” argument and in some cases it is accompanied by coercive rhetoric to the effect that if you don’t get on the infrastructure bandwagon and use standards then you will be left out (or not funded.) I guess I am suggesting that we could be seeing a shift in what is considered legitimate research and what is considered closed and therefore ready for infrastructure. The tool project could be on the way out as research as it is moved as a problem into the domain of support (of infrastructure.) Is this a bad thing? It certainly will be a good thing if it leads to robust and widely usable technology. But could it be a cyclical trend where today’s research becomes tomorrows infrastructure to then be rediscovered later as a research problem all over.

JSTOR: Data for Research Visualization

"Dialogue" in Philosophy Journals
"Dialogue" in Philosophy Journals

Thanks to Judith I have been playing with JSTOR’s Data for Research (DfR). They provide a faceted way of visualizing and search the entire JSTOR database. Features include:

  • Full-text and fielded searching of the entire JSTOR archive using a powerful faceted search interface. Using this interface one can quickly and easily define content of interest through an iterative process of searching and results filtering.
  • Online viewing of document-level data including word frequencies, citations, key terms, and ngrams.
  • Request and download datasets containing word frequencies, citations, key terms, or ngrams associated with the content selected.
  • API for content selection and retrieval. (from the About page)

I’m impressed by how much they expose. They even have a Submit Data Request and an API. This is important – we are seeing a large scale repository exposing its information to new types of queries other than just search.

Pontypool Changes Everything

honey.jpg

Back to Pontypool, the semiotic zombie movie that has infected me. The image above is of the poster for the missing cat Honey that seems to have something to do with the start of the semiotic infection. The movie starts with Grant Mazzy’s voice over the radio talking about,

Mrs French’s cat is missing. The signs are posted all over town. Have you seen Honey? Well, we have all seen the posters, but nobody has seen Honey the cat. Nobody, until last Thursday morning when Miss Collettepiscine … (drove off the bridge to avoid the cat)

He goes on to pun on “Pontypool” (the name of the town the movie takes place in), Miss Collettepiscine’s name (French for “panty-pool”), and the local name of the bridge she drove off. He keeps repeating variations of Pontypool a hint at the language virus to come.

As for the language virus, I replayed parts of the movie where they talk about it. At about 58 minutes in they hear the character Ken clearly get infected and begin to repeat himself as they talk on the cellphone. Dr. Mendez concludes, “That’s it, he is gone. He is just a crude radio signal, seeking.” A little later Mendez gets it and proposes,

Mendez: No … it can’t be, it can’t be. It’s viral, that much is clear. But not of the blood, not of the air, not on or even in our bodies. It is here.

Grant: Where?

Mendez: It is in words. Not all words, not all speaking, but in some. Some words are infected. And it spreads out when the contaminated word is spoken. Ohhhh. We are witnessing the emergence of a new arrangement for life … and our language is its host. It could have sprung spontaneously out of a perception. If it found its way into language it could leap into reality itself, changing everything. It may be boundless. It may be a God bug.

Grant: OK, Dr. Mendez. Look, I don’t even believe in UFOs, so I … I’ve got to stop you there with that God bug thing.

Mendez: Well that is very sensible because UFOs don’t exist. But I assure you, there is a monster loose and it is bouncing through our language, frantically trying to keep its host alive.

Grant: Is this transmission itself … um …

Mendez: No, no, no, no. If the bug enters us, it does not enter by making contact with our eardrum. It enters us when we hear the word and we understand it. Understand?

It is when the word is understood that the virus takes hold. And it copies itself in our understanding.

Grant: Should we be … talking about this?

Sydney: What are we talking about?

Grant: Should we be talking at all?

Mendez: Well, to be safe, no, probably not. Talking is risky, and well, talk radio is high risk. And so … we should stop.

Grant: But, we need to tell people about this. People need to know. We have to get this out.

Mendez: Well it’s your call Mr Mazzy. But let’s just hope that your getting out there doesn’t destroy your world.

As one thoughtful review essay points out, Pontypool is not the first to play with the meme of information viruses that can infect us. Snow Crash, the Stephenson novel which features a language-virus, even appears in the movie.

Pontypool itself is infectious, morphing from form to form. Sequels are threatened. The book, Pontypool Changes Everything, which starts with a character who keeps Ovid’s Metamorphoses in his, led to the movie which led to the radio play which was created by re-editing the movie audio (and it apparently has a different ending with “paper”.)

Google Book Search Settlement

The Google Book Search Settlement, if approved by Judge Chin, may be a turning point in textual research. In principle, if the settlement goes through, then Google will release the full 7-10 million books for research (“non-consumptive”) use. Should get even the 500,000 public domain books for research we will have a historic corpus far larger than anything else. To quote the Greg Crane D-Lib article, “What can you do with a million books?” and “What effect will millions of books have on the textual disciplines?”

There is understandably a lot of concerns about the settlement especially about the ownership of orphan works. The American Library Association has a web site on the settlement, as do others. I think we need to also start talking about how to develop a research infrastructure to allow the millions of books to be used effectively. What would it look like? What could we do? Some ideas:

  • To be only usable by researchers there would have to be some sort of reasonable firewall.
  • It would be nice if it were truly multilingual/multicultural from the start. The books are, after all.
  • It would be nice if there was a mechanism for researchers to correct the OCRed text where they see typos. Why couldn’t we clean up the plain text together.
  • It would be nice if there was an open architecture search engine scaled to handle the collection and usable by research tools.

Update: Matt pointed me to an article in the Wall Street Journal on Tech’s Bigs Put Google’s Books Deal In Crosshairs.

Hacking as a Way of Knowing: Our Project on Flickr

Photo of Projection

I put a photo set up on Flickr for our Hacking as a Way of Knowing project. The set documents the evolution of the project which I’ve tentatively named the “ReReader for the Writing on the Wall”. Thanks to all those who made the project and the workshop a success. Now I have to think a bit deeper about making as knowing and things as theories.

IBM Watson: Question Answering for Jeopardy

Jeopardy BoardFrom IBM Labs YouTube presence news of a IBM “Watson” System to Challenge Humans at Jeopardy! IBM’s Watson is a Question Answering system that IBM scientists hope “will be able to understand complex questions and answer with enough precision and speed to compete on Jeopardy!” (From the IBM press release.)

Watson will be designed to deftly handle semantics “the meanings behind words” which will enable it to answer questions that require the identification of relevant and irrelevant content, the interpretation of ambiguous expression and puns, the decomposition of questions into sub-questions, and the logical synthesis of final answers. In addition, Watson will compute a statistical confidence in the responses it provides. Watson will be designed to do all of this in a matter of seconds, which will enable it to compete against humans, who have the ability to know what they know in less than a second. (From Addendum: About IBM’s Watson System)

The language IBM uses around this project is that of a “Grand Challenge.” It is smart how they have taken a analytical problem and used Jeopardy to give a target for the research, both in terms of speed and the types of questioning handled. Jeopardy also gives them a dramatic venue to demonstrate their progress just as Deep Blue playing Kasparov did.

The research is based on a Open Architecture for Question Answering (OAQA) that was jointly developed with Carnegie Mellon.