Tasman: Literary Data Processing

I came across a 1957 article by an IBM scientist, P. Tasman on the methods used in Roberto Busa’s Index Thomisticus project, with the title Literary Data Processing (IBM Journal of Research and Development, 1(3): 249-256.) The article, which is in the third issue of theĀ IBM Journal of Research and Development, has an illustration of how they used punch cards for this project.

Image of Punch Card

While the reproduction is poor, you can read the things encoded on the card for each word:

  • Location in text
  • Special reference mark
  • Word
  • Number of word in text
  • First letter of preceding word
  • First letter of following word
  • Form card number
  • Entry card number

At the end Tasman speculates on how these methods developed on the project could be used in other areas:

Apart from literary analysis, it appears that other areas of documentation such as legal, chemical, medical, scientific, and engineering information are now susceptible to the methods evolved. It is evident, of course, that the transcription of the documents in these other fields necessitates special sets of ground rules and codes in order to provide for information retrieval, and the results will depend entirely upon the degree and refinement of coding and the variety of cross referencing desired.

The indexing and coding techniques developed by this method offer a comparatively fast method of literature searching, and it appears that the machine-searching application may initiate a new era of language engineering. It should certainly lead to improved and more sophisticated techniques for use in libraries, chemical documentation, and abstract preparation, as well as in literary analysis.

Busa’s project may have been more than just the first humanities computing project. It seems to be one of the first projects to use computers in handling textual information and a project that showed the possibilities for searching any sort of literature. I should note that in the issue after the one in which Tasman’s article appears you have an article by H. P. Luhn (developer of the KWIC) on A Statistical Approach to Mechnized Encoding and Searching of Literary Information. (IBM Journal of Research and Development 1(4): 309-317.) Luhn specifically mentions the Tasman article and the concording methods developed as being useful to the larger statistical text mining that he proposes. For IBM researchers Busa’s project was an important first experiment handling unstructured text.

I learned about the Tasman article in a journal paper deposited by Thomas Nelson Winter on Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance. The paper gives an excellent account of Busa’s project and its significance to concording. Well worth the read!

Juxta Commons

Vis-icons

From Humanist I just learned about Juxta Commons. This is a web version of the earlier downloadable Java tool. The new version still has the lovely interface that shows the differences between variants. The commons however, builds on the personal computer tool by being a place where collations can be kept. Others can find and explore your collations. You can search the commons and find collation projects.

Another interesting feature is that they have Google ads if you search the commons. The search is “powered by Google” so perhaps that comes with the service.

Pundit: A novel semantic web annotation tool

Susan pointed me to Pundit: A novel semantic web annotation tool. Pundit (which has a great domain name “thepund.it”) is an annotation tool that lets people create and share annotations on web materials. The annotations are triples that can be saved and linked into DBpedia and so on. I’m not sure I understand how it works entirely, but the demo is impressive. It could be the killer-app of semantic web technologies for the digital humanities.

War and Peace gets Nookd

From Slashdot I found this blog entry Ocracoke Island Journal: Nookd about how a Nook version of War and Peace had the word “kindle” replaced by “nook” as in “It was as if a light has been Nooked (kindled) in a carved and painted lantern…” It seems that the company that ported the Kindle version over to the Nook ran a search and replace on the word Kindle and replaced it with Nook.

I think this should be turned into a game. We should create an e-reader that plays with the text in various ways. We could adapt some of Steve Ramsay’s algorithmic ideas (reversing lines of poetry). Readers could score points by clicking on the words they think were replaced and guessing the correct one.

Globalization Compendium Archive

I have been working for a while on archiving the Globalization Compendium which I worked on. Yesterday I got it archived in two Institutional Repositories:

In both cases there is a Zip of a BagIt bag with the XML files, code and other documentation from the site. My first major deposit.

Canadian Writing Research Collaboratory Launch

 

I am at the Canadian Writing Research Collaboratory (CWRC) launch. CWRC is building a collaborative editing environment that will allow editorial projects to manage the editing of electronic scholarly editions. Among other things CWRC is developing an online XML editor, a editorial workflow management tools, and integrated repository.

The keynote speakers for the event include Shawna Lemay and Aritha Van Herk.

Ruecker on Visualizing Time

Stan Ruecker gave a great talk today about Visualations in Time for the Humanities Computing Research Colloquium. He is leading a SSHRC funded project that builds on Drucker and Noviskie’s work on Temporal Modelling. (I should mention that I am on the project.) Stan started by talking about all the challenges to the linear visualization of time that you see in tools like Simile. They include:

  • Uncertainty: in some cases we don’t know when it took place.
  • Relative time: how do we visualize all the ways we talk about time as relative (ie. events being before or after another)?
  • Phenomenological time: how do we represent the experience of time.
  • Reception: there is not only the time something happens but the time it is read or received.

Stan then showed a number of visual designs for these different ways of thinking about time. Some looked like rubber sheets, some like frameworks of cubes with things in them, and some like water droplets. Many of these avoided the “line” in the visualization of time.

How to communicate the dangers of nuclear waste to future civilizations.

Reading Umberto Eco’s The Search for the Perfect Language I came across a discussion Thomas Sebeok’s work for the U.S. Office of Nuclear Waste Management on “Communication Measures To Bridge Ten Millennia.” Sebeok was commissioned to figure out how to warn people about nuclear waste in 10,000 years. How do you design a warning system that can survive for tens of thousands of years? He proposed an artificial folklore with a priestly caste to maintain superstitions about the site. He ended up recommending

that information be launched and artificially passed on into the short-term and long-term future with the sup- plementary aid of folkloristic devices, in particular a combination of an artificially created and nurtured ritual-and-legend. …

The legend-and-ritual, as now envisaged, would be tantamount to laying a “false trail”, meaning that the uninitiated will be steered away from the hazardous site for reasons other than the scientific knowledge of the possibility of radiation and its implications; essentially, the reason would be accumulated superstition to shun a certain area permanently. (p. 24)

Slate Magazine has a great story on the issue of Atomic Priesthoods, Thorn Landscapes, and Munchian Pictograms: How to communicate the dangers of nuclear waste to future civilizations by Juliet Lapidos (Nov. 16, 2009.) She surveys some of the interesting ideas like “Menacing Earthworks” that would warn people off, and talks about a 1993 SANDIA report titled, “Expert Judgment on Markers To Deter Inadvertent Human Intrusion Into the Waste Isolation Pilot Plant.”