H. P. Luhn, KWIC and the Concordance

We all know that the Google display comes indirectly from the Concordance, but I have found in Luhn’s 1966 “Keyword-in-Context Index for Technical Literature (Kwic Index)” the explicit recognition of the link and the reason for drawing on the concordance.

the significance of such single keywords could, in most instances, be determined only by referring to the statement from which the keyword had been chosen. This somewhat tedious procedure may be alleviated to a significant degree by listing selected keywords together with surrounding words that act as modifiers pointing up the more specific sense in which a keyword has been applied. This method of indexing words is well established in the process of compiling concordances of important works of literature of the past. The added degree of information conveyed by such keyword-in-context indexes, or “KWIC Indexes” for short, can readily be provided by automatic processing. (p. 161)

The problem for Luhn is that simply retrieving words doesn’t give you a sense of their use. His solution, first shown in the late 1950s, was to provide some context (hence “keyword-in-context”) so that readers can disambiguate themselves and make decisions about which index items to follow. It is from the KWIC that we ultimately get the concordance features of the Google display, though it should be noted that Luhn was proposing KWIC as a way of printing automatically generated literature indexes where the kewwords were in the titles. In this quote Luhn explicitly acknowledges that this is a method well established in concordances.

There is also a link between Luhn and Father Busa. According to Black, quoted in Marguerite Fischer, “The Kwic Index Concept: A Retrospective View”,

the Pontifical Faculty of Philosophy in Milan decided that they would make an analytical index and concordance to the Summa Theologica of St. Thomas Aquinas, and approached IBM about the possibility of having the operations performed on Data Processing. Experience gained in this project contributed towards the development of the KWIC Index. (This is a quote on page 123 from Black, J. D., 1962, “The Keyword: Its Use in Abstracting, Indexing, and Retrieving Information”.)

From the concordance to KWIC through to Google?

For some historical notes on Luhn see, H. P. Luhn and Automatic Indexing.

Project Bamboo

Bamboo LogoI attended Workshop 3 of Project Bamboo in Tucson Arizona this week. I think I’m beginning to understand it, though understanding what Bamboo is was one of the favorite subjects of conversation of the meeting (so I’m conscious that . There is a deliberate ambiguity to the project since they are trying to listen to the community in order to become what we want rather than what we suspect. Some of my takeaway thoughts:

  • It is being structured as a consortium. Thus the long term sustainability model is that universities (and possibly associations and individuals) will contribute resources into the consortium and get back services for their faculty. This seems the right way to get to a level of broad support.
  • One thing Bamboo will do is develop shared services that participating universities can use to deliver research support.
  • One of the challenges is figuring out how to listen to the community. The stories are the mechanism being used for this. Scholars are contributing stories of what they do and what they want to do. In some cases the stories are being contributed by people who talk to faculty.
  • Recipes (like those we developed for TAPoR) will be a key way to connect stories to the shared services. A recipe is a way of abstracting from a lot of stories something that can be used to identify the tools and content needed by researchers to do useful work.
  • Bamboo probably won’t build tools, but they will build and run services with which others can build tools. Bamboo may be the project that runs SEASR as a service for the rest of us, for example. We can then build tools with SEASR for our research projects.
  • Bamboo is talking about running the shared services in a cloud. I’m not sure what that means yet.

Cornell Web Lab: Large scale web research

Diagram from Web Lab Paper

The Cornell Web Lab is an interesting example of a high performance computing project in the humanities and social sciences. As they say,

The Web Laboratory is a joint project of Cornell University and the Internet Archive to provide data and computing tools for research about the Web and the information on the Web.

In a paper on the project, A Research Library Based on the Historical Collections of the Internet Archive, William Arms and colleagues point out that the data challenge of the social sciences (and humanities) is that the data is poorly structured and there is a lot of it. The Internet Archive is a case in point; as of 2006 they had 5 to 6 petabytes of data of web pages. While it is amazing that we have such archives in computer (and human) readable form, it is hard to do anything with that much. The Web Lab approach is to provide HPC basic services for extracting subsets of the whole that can then be used by other tools.

Pliny: Welcome

Screen Shot of Pliny Pliny, the annotation and note management tool by John Bradley at King’s College London just got a Mellon Award for Technology Collaboration.

The Mellon Awards honour not-for-profit organisations for leadership in the collaborative development of open source software tools with application to scholarship in the arts and humanities, as well as cultural-heritage not-for-profit activities.

Pliny is free and you can try it out on the Mac or PC. John has thought a lot about how tools fit in the research process of humanists.

NiCHE: The Programming Historian

NiCHE logoNiCHE (Network in Canadian History & Environment) has a useful wiki called The Programming Historian by William Turkel and Alan MacEachern. The wiki is a “tutorial-style introduction to programming for practicing historians” but it is could also be used by textual scholars who want to be able to program their own tools. It takes you through learning and using Python for text processing for things like word frequencies and KWICs. It reminds me of Susan Hockey’s book, Snobol Programming for the Humanities. (Oxford: Oxford University Press, 1985) which I loved at the time, even if I couldn’t find a Snobol interpreter for the Mac.

We need more of such books/wikis.

Conference Report: Tools For Data-Driven Scholarship

I just got back from the Tools For Data-Driven Scholarship meeting organized by MITH and the Centre for New Media and History. This meeting was funded by the NEH, NSF, and the IMLS and brought together tool developers, content providers (like museums and public libraries), and funders (NEH, JISC, Mellon, NSF and IMLS.) The goal was to imagine initiative(s) that could advance humanities tool development and connect tools better with audiences. I have written a Conference Report with my notes on the meeting. One of the interesting questions asked by a funder was “What do the developers really want?” It was unclear that developers really wanted some of the proposed solutions like a directory of tools or code repository. Three things the breakout group I was in came up with was:

  • Recognition, credit and rewards for tool development – mechanisms to get academic credit for tool development. This could take the form of tool review, competitions, prizes or just citation when our tool is used. In other words we want attention.
  • Long-term Funding so that tool development can be maintained. A lot of tool development takes place in grants that run out before the tool can really be tested and promoted to the community. In other words we want funding to continue tool development without constantly writing grants.
  • Methods, Recipes, and Training that are documented that bring together tools in the context of humanities research practices. We want others with the outreach and writing skills to weave stories about their use to help introduce tools to others. In other words we want others to do the marketing of our tools.

A bunch of us sitting around after the meeting waiting for a plane had the usual debriefing about such meetings. What do they achieve even if they don’t lead to initiatives. From my perspective these meeting are useful in unexpected ways:

  • You meet unexpected people and hear about tools that you didn’t know about. The social dimension is important to meetings organized by others that bring people together from different walks. I, for example, finally met William Turkle of Digital History Hacks.
  • Reports are generated that can be used to argue for support without quoting yourself. There should be a report from this meeting.
  • Ideas for initiatives are generated that can get started in unexpected ways. Questions emerge that you hadn’t thought of. For example, the question of audience (both for tools and for initiatives) came up over and over.

University Libraries in Google Project to Offer Backup Digital Library – Chronicle.com

Hathi Slogan and LogoFrom Bethany I discovered this story by the Chronicle of Higher Education about the HathiTrust, titled University Libraries in Google Project to Offer Backup Digital Library (Jeffrey R. Young, Oct. 13, 2008). “Hathi” is the hindi word for elephant suggesting memory and size. Here is a quote from the HathiTrust site:

As a digital repository for the nation’s great research libraries, HathiTrust (pronounced hah-TEE) brings together the immense collections of partner institutions.

HathiTrust was conceived as a collaboration of the thirteen universities of the Committee on Institutional Cooperation and the University of California system to establish a repository for these universities to archive and share their digitized collections. Partnership is open to all who share this grand vision.

The repository, among other things, will pool the volumes digitized by Google in collaboration with the universities so there is a backup should Google lose interest. Large-scale search is being studied now and they expect in November to have preview version available.

A Companion to Digital Literary Studies

Cover of Companion The A Companion to Digital Literary Studies edited by Ray Siemens and Susan Schreibman is available online in full text. This is tremendous resource with too many excellent contributions to list individually. Chapters go from Reading on the Screen by Christian Vandendorpe and Algorithmic Criticism by Stephen Ramsay.

There is a good Annotated Overview of Selected Electronic Resources by Tanya Clement and Gretchen Gueguen with links to projects like TAPoR.

Newsknitter: Knitted Visualization

Image of knitted news
Newsknitter is a project that gathers news from RSS feeds and then generates a visualization that can then be knitted into a sweater. Check out the images of sweaters knitted. This project has been exhibited at Ars Electronica and is the work of two PhD candidates at Kunstuniversität Linz. At first the idea of machine knitted sweaters of text visualization sounds like a conceptual art work with no future, but as I think about it, the idea of just-in-time information being visualized and used to generate stable material objects like a sweater sounds timely. All sorts of objects could have their designs generated on the spot and on demand from information off the net. Why should data be only visualized and not materialized?