Text Analysis – Page 18

BigSee in a Cave

We got the Big See project working in the cave at U Alberta. The Big See software has various 3D modes so it can be displayed for viewing in a CAVE.

JSTOR: Data for Research Visualization

Thanks to Judith I have been playing with JSTOR’s Data for Research (DfR). They provide a faceted way of visualizing and search the entire JSTOR database. Features include:

Full-text and fielded searching of the entire JSTOR archive using a powerful faceted search interface. Using this interface one can quickly and easily define content of interest through an iterative process of searching and results filtering.

Online viewing of document-level data including word frequencies, citations, key terms, and ngrams.

Request and download datasets containing word frequencies, citations, key terms, or ngrams associated with the content selected.

API for content selection and retrieval. (from the About page)

I’m impressed by how much they expose. They even have a Submit Data Request and an API. This is important – we are seeing a large scale repository exposing its information to new types of queries other than just search.

Setup TAToo

Peter Organsciak has created a new version of TAToo.

To try to set your own up see, Setup you TAToo.

Pontypool Changes Everything

Back to Pontypool, the semiotic zombie movie that has infected me. The image above is of the poster for the missing cat Honey that seems to have something to do with the start of the semiotic infection. The movie starts with Grant Mazzy’s voice over the radio talking about,

Mrs French’s cat is missing. The signs are posted all over town. Have you seen Honey? Well, we have all seen the posters, but nobody has seen Honey the cat. Nobody, until last Thursday morning when Miss Collettepiscine … (drove off the bridge to avoid the cat)

He goes on to pun on “Pontypool” (the name of the town the movie takes place in), Miss Collettepiscine’s name (French for “panty-pool”), and the local name of the bridge she drove off. He keeps repeating variations of Pontypool a hint at the language virus to come.

As for the language virus, I replayed parts of the movie where they talk about it. At about 58 minutes in they hear the character Ken clearly get infected and begin to repeat himself as they talk on the cellphone. Dr. Mendez concludes, “That’s it, he is gone. He is just a crude radio signal, seeking.” A little later Mendez gets it and proposes,

Mendez: No … it can’t be, it can’t be. It’s viral, that much is clear. But not of the blood, not of the air, not on or even in our bodies. It is here.

Grant: Where?

Mendez: It is in words. Not all words, not all speaking, but in some. Some words are infected. And it spreads out when the contaminated word is spoken. Ohhhh. We are witnessing the emergence of a new arrangement for life … and our language is its host. It could have sprung spontaneously out of a perception. If it found its way into language it could leap into reality itself, changing everything. It may be boundless. It may be a God bug.

Grant: OK, Dr. Mendez. Look, I don’t even believe in UFOs, so I … I’ve got to stop you there with that God bug thing.

Mendez: Well that is very sensible because UFOs don’t exist. But I assure you, there is a monster loose and it is bouncing through our language, frantically trying to keep its host alive.

Grant: Is this transmission itself … um …

Mendez: No, no, no, no. If the bug enters us, it does not enter by making contact with our eardrum. It enters us when we hear the word and we understand it. Understand?

It is when the word is understood that the virus takes hold. And it copies itself in our understanding.

Grant: Should we be … talking about this?

Sydney: What are we talking about?

Grant: Should we be talking at all?

Mendez: Well, to be safe, no, probably not. Talking is risky, and well, talk radio is high risk. And so … we should stop.

Grant: But, we need to tell people about this. People need to know. We have to get this out.

Mendez: Well it’s your call Mr Mazzy. But let’s just hope that your getting out there doesn’t destroy your world.

As one thoughtful review essay points out, Pontypool is not the first to play with the meme of information viruses that can infect us. Snow Crash, the Stephenson novel which features a language-virus, even appears in the movie.

Pontypool itself is infectious, morphing from form to form. Sequels are threatened. The book, Pontypool Changes Everything, which starts with a character who keeps Ovid’s Metamorphoses in his, led to the movie which led to the radio play which was created by re-editing the movie audio (and it apparently has a different ending with “paper”.)

Google Book Search Settlement

The Google Book Search Settlement, if approved by Judge Chin, may be a turning point in textual research. In principle, if the settlement goes through, then Google will release the full 7-10 million books for research (“non-consumptive”) use. Should get even the 500,000 public domain books for research we will have a historic corpus far larger than anything else. To quote the Greg Crane D-Lib article, “What can you do with a million books?” and “What effect will millions of books have on the textual disciplines?”

There is understandably a lot of concerns about the settlement especially about the ownership of orphan works. The American Library Association has a web site on the settlement, as do others. I think we need to also start talking about how to develop a research infrastructure to allow the millions of books to be used effectively. What would it look like? What could we do? Some ideas:

To be only usable by researchers there would have to be some sort of reasonable firewall.
It would be nice if it were truly multilingual/multicultural from the start. The books are, after all.
It would be nice if there was a mechanism for researchers to correct the OCRed text where they see typos. Why couldn’t we clean up the plain text together.
It would be nice if there was an open architecture search engine scaled to handle the collection and usable by research tools.

Update: Matt pointed me to an article in the Wall Street Journal on Tech’s Bigs Put Google’s Books Deal In Crosshairs.

Twitterfall

Thanks to James K I discovered Twitterfall which shows a waterfall of tweets based on keywords and other settings. An interesting example of what StÃ©fan Sinclair and I call a Knowledge Radio – an application that organizes knowledge into a live stream.

Hacking as a Way of Knowing: Our Project on Flickr

I put a photo set up on Flickr for our Hacking as a Way of Knowing project. The set documents the evolution of the project which I’ve tentatively named the “ReReader for the Writing on the Wall”. Thanks to all those who made the project and the workshop a success. Now I have to think a bit deeper about making as knowing and things as theories.

IBM Watson: Question Answering for Jeopardy

From IBM Labs YouTube presence news of a IBM “Watson” System to Challenge Humans at Jeopardy! IBM’s Watson is a Question Answering system that IBM scientists hope “will be able to understand complex questions and answer with enough precision and speed to compete on Jeopardy!” (From the IBM press release.)

Watson will be designed to deftly handle semantics “the meanings behind words” which will enable it to answer questions that require the identification of relevant and irrelevant content, the interpretation of ambiguous expression and puns, the decomposition of questions into sub-questions, and the logical synthesis of final answers. In addition, Watson will compute a statistical confidence in the responses it provides. Watson will be designed to do all of this in a matter of seconds, which will enable it to compete against humans, who have the ability to know what they know in less than a second. (From Addendum: About IBM’s Watson System)

The language IBM uses around this project is that of a “Grand Challenge.” It is smart how they have taken a analytical problem and used Jeopardy to give a target for the research, both in terms of speed and the types of questioning handled. Jeopardy also gives them a dramatic venue to demonstrate their progress just as Deep Blue playing Kasparov did.

The research is based on a Open Architecture for Question Answering (OAQA) that was jointly developed with Carnegie Mellon.

H. P. Luhn, KWIC and the Concordance

We all know that the Google display comes indirectly from the Concordance, but I have found in Luhn’s 1966 “Keyword-in-Context Index for Technical Literature (Kwic Index)” the explicit recognition of the link and the reason for drawing on the concordance.

the significance of such single keywords could, in most instances, be determined only by referring to the statement from which the keyword had been chosen. This somewhat tedious procedure may be alleviated to a significant degree by listing selected keywords together with surrounding words that act as modifiers pointing up the more specific sense in which a keyword has been applied. This method of indexing words is well established in the process of compiling concordances of important works of literature of the past. The added degree of information conveyed by such keyword-in-context indexes, or “KWIC Indexes” for short, can readily be provided by automatic processing. (p. 161)

The problem for Luhn is that simply retrieving words doesn’t give you a sense of their use. His solution, first shown in the late 1950s, was to provide some context (hence “keyword-in-context”) so that readers can disambiguate themselves and make decisions about which index items to follow. It is from the KWIC that we ultimately get the concordance features of the Google display, though it should be noted that Luhn was proposing KWIC as a way of printing automatically generated literature indexes where the kewwords were in the titles. In this quote Luhn explicitly acknowledges that this is a method well established in concordances.

There is also a link between Luhn and Father Busa. According to Black, quoted in Marguerite Fischer, “The Kwic Index Concept: A Retrospective View”,

the Pontifical Faculty of Philosophy in Milan decided that they would make an analytical index and concordance to the Summa Theologica of St. Thomas Aquinas, and approached IBM about the possibility of having the operations performed on Data Processing. Experience gained in this project contributed towards the development of the KWIC Index. (This is a quote on page 123 from Black, J. D., 1962, “The Keyword: Its Use in Abstracting, Indexing, and Retrieving Information”.)

From the concordance to KWIC through to Google?

For some historical notes on Luhn see, H. P. Luhn and Automatic Indexing.