Text Analysis – Page 22

Scholarly Work in the Humanities and the Evolving Information Environment

John Bradley’s abstract for his talk at this year’s Digital Humanities conference, Thinking Differently About Thinking: Pliny and Scholarship in the Humanities pointed me to one of the better discussions about what we know about how humanities scholars do research. Scholarly Work in the Humanities and the Evolving Information Environment is a CLIR report that is available in HTML and PDF. The thing that stands out for me reading this is that humanists are readers (and writers.) Reading is research and writing is research. As John puts it when he talks about Pliny, the hard thing to pin down is when we shift from Reading/Interpreting to Interpreting/Writing. It is that turn when you think you can respond to what you have read that is what Pliny (and other types of notetaking software like Tinderbox) is supposed to help with. If you have invested the time in taking notes while reading then those notes become useful to writing.

John Bradley: Pliny

John Bradley gave a talk today at McMaster about Pliny. Pliny is annotation and note taking software that is designed to support humanities research. Pliny is built on Eclipse. John argued that we should be thinking of developing Eclipse plugins that are compatible – using Eclipse as a research environment.

Davidson: Data Mining, Collaboration, and Institutional Infrastructure for Transforming Research and Teaching in the Human Sciences and Beyond

Cathy Davidson has a summary article in CTWatch Quarterly titled, Data Mining, Collaboration, and Institutional Infrastructure for Transforming Research and Teaching in the Human Sciences and Beyond. The article makes some good points about how we have to rethink research in the humanities in the face of digital evidence.

Bibliographic work, translation, and indexical scholarship should also have a place in the reward system of the humanities, as they did in the nineteenth century. The split between ‚Äúinterpretation‚Äù or ‚Äútheoretical‚Äù or ‚Äúanalytical‚Äù work on the one hand and, on the other, ‚Äúarchival work‚Äù or ‚Äúediting‚Äù falls apart when we consider the theoretical, interpretive choices that go into decisions about what will be digitized and how. Do we go with taxonomy (formal categorizing systems as evolved by trained archivists)? Or folksonomy (categories arrived at by users, many of which offer less precise organization than professional indexes but often more interesting ones that point out ambiguities and variabilities of usage and application)?

We also need to rethink paper as the gold standard of the humanities. If scholarship is better presented in an interactive 3-D data base, why does the scholar need to translate that work to a printed page in order for it to ‚Äúcount‚Äù towards tenure and promotion? It makes no sense at all if our academic infrastructures are so rigid that they require a ‚Äúdumbing down‚Äù of our research in order for it to be visible enough for tenure and promotion committees.

Davidson talks about a first generation digital humanities and then makes a Web 2.0 argument about the overwhelming amount of data being gathered and new paradigms. I’m not convinced she really understands the achievements of the first generation, if there is such a clear generational division, there is no mention of the TEI or the work on literary text analysis and publishing.

Blacklight: Faceted searching at UVA

Screen capture of Blacklight Blacklight is a neat project that Bethany Nowviskie pointed me to at the University of Virginia. They have indexed some 3.7 million records from their library online catalogue and set up a faceted search and browse tool.

What is faceted searching and browsing? Traditionally search environments like those for finding items in a library have you fill in fields. In Blacklight you can both search with words, but you can also add constraints by clicking on categories within the metadata. So, if I search for “gone with the wind” in Blacklight it shows that there are 158 results. On right it shows how those results are distributed over different categories. It shows me that 41 of these are “BOOK” in the category “format”. If I click on “BOOK” it then adds a constraint and updates the categories I can use further. Backlight makes good use of inline graphics (pie charts) so you can see at a glance what percentage of the remaining results are in what category type.

This faceted browsing is a nice example of a rich-prospect view on data where you can see and navigate by a “prospect” of the whole.

Blacklight came out of work on Collex. It is built on Flare which harnesses Solr through Ruby on Rails. As I understand it, Blacklight is also interesting as an open-source experimental alternative to very expensive faceted browsing tools that comes out of the Collex project. It is a “love letter to the Library” from a humanities computing project and its programmer.

TAPoRware Word Cloud

We’ve been playing with ways to make text analysis tools like word clouds that don’t need parameters work automatically on loading a page. See TAPoRware Word Cloud documentation. Here is an example.

An alternate beginning to humanities computing

Reading Andrew Booth’s Mechanical Resolution of Linguistic Problems (1958) I came across some interesting passages about the beginnings of text computing that suggest an alternative to the canonical Roberto Busa story of origin. Booth (the primary author) starts the book with a “Historical Introduction” in which he alludes to Busa’s project as part of a list of linguistic problems that run parallel to the problems of machine translation:

In parallel with these (machine translation) problems are various others, sometimes of a higher, sometimes of a lower degree of sophistry. There is, for example the problem of the analysis of the frequency of occurence of words in a given text. … Another problem of the same generic type is that of constructing concordances for given texts, that is, lists, usually in alphabetic order, of the words in these texts, each word being accompanied by a set of page and line references to the place of its occurrence. … The interest at Birkbeck College in this field was chiefly engendered by some earlier research work on the Dialogues of Plato … Parallel work in this field has been carried out by the I.B.M. Corporation, and it appears that some of this work is now being put to practical use in the preparation of a concordance for the works of Thomas Aquinas.
A more involved application of the same sort is to the stylistic analysis of a work by purely mechanical means. (p. 5-6)

In Mechanical Resolutions he continues with a discussion of how to use computers to count words and to generate concordances. He has a chapter on the problem of Plato’s dialogues which seems to have been a set problem at that time and, of course, there are chapters on dictionaries and machine translation. He describes some experiments he did starting in the late 40s that suggest that Searle’s Chinese Room Argument of 1980 might have been based on real human simulations.

Although no machine was available at this time (1948), the ideas of Booth and Richens were extensively tested by the construction of limited dictionaries of the type envisaged. These were used by a human untutored in the languages concerned, who applied only those rules which could eventually be performed by a machine. The results of these early ‘translations’ were extremely odd, … (p. 2)

Did others run such simulations of computing with “untutored” humans in the early years when they didn’t have access to real systems? See also the PDF of Richens and Booth, Some Methods of Mechanized Translation.

As for Andrew D. Booth, he ended up in Canada working on French/English translation for the Hansard, the bilingual transcript of parlimentary debates. (Note that Bill Winder has also been working on these, but using them as source texts for bilingual collocations. ) Andrew and Kathleen Booth wrote a contribution on The Origins of MT (PDF) that describes his early encounters with pioneers of computing around the possibilities of machine translation starting in 1946.

We date realistic possibilities starting with two meetings held in 1946. The first was between Warren Weaver, Director of the Natural Sciences Division of the Rockefeller Foundation, and Norbert Wiener. The second was between Weaver and A.D. Booth in that same year. The Weaver-Wiener discussion centered on the extensive code-breaking activities carried out during the World War II. The argument ran as follows: decryption is simply the conversion of one set of “words”â€“the codeâ€“into a second set, the message. The discussion between Weaver and A.D. Booth on June 20, 1946, in New York identified the fact that the code-breaking process in no way resembled language translation because it was known a priori that the decrypting process must result in a unique output. (p. 25)

Booth seems to have successfully raised funds from the Nuffield Foundation for a computer at Birkbeck College at the University of London that was used by L. Brandwood for work on Plato, among others. In 1962 he and his wife migrated to Saskatchewan to work on bilingual translation and then to Lakehead in Ontario where they “continued with emphasis on the construction of a large dictionary and the use of statistical techniques in linguistic analysis” in 1972. They retired to British Columbia in 1978 as most sensible Canadians do.

In short, Andrew Booth seems to have been involved in the design of early computers in order to get systems that could do machine translation and that led him to support a variety of text processing projects including stylistic analysis and concording. His work has been picked up as important to the history of machine translation, but not for the history of humanities computing. Why is that?
In a 1960 paper on The future of automatic digital computers he concludes,

My feeling on all questions of input-output is, however, the less the better. The ideal use of a machine is not to produce masses of paper with which to encourage Parkinsonian administrators and to stifle human inventiveness, but to make all decisions on the basis of its own internal operations. Thus computers of the future will communicate directly with each other and human beings will only be called on to make those judgements in which aesthetic considerations are involved. (p. 360)

Long Bets Now

Have you ever wanted to go on record with a prediction? Would you like put money (that goes to charity) on your prediction? The Long Bets Foundation lets you do just that. It is a (partial) spin-off of The Long Now Foundation where you can register and make long-term predictions (up to thousands of years, I believe.) The money bet and challenged goes to charity; all you get if you are right is credit and the choice of charity. An example prediction in the text analysis arena is:

Gregory W. Webster predicts: “That by 2020 a wearable device will be available that will use voice recognition capability and high-volume storage to monitor and index conversations you have or conversations which occur in your vicinity for later searching as supplemental memory.” (Prediction 16)

Some of the other predictions of interest to humanists are: 177 about print on demand, 179 about reading on digital devices, and 295 about a second renaissance.

The Long Bet has some interesting people making predictions and bets (a prediction becomes a bet when formally challenged) including Ray Kurzweil betting against Mitch Kapor that “By 2029 no computer – or “machine intelligence” – will have passed the Turing Test.” (Bet 1)

Just to make life interesting there is a prediction 137 that “The Long Bets Foundation will no longer exist in 2104.” 63% of the voters seem to agree!

Towards a pattern language for text analysis and visualization

One outcome of the iMatter meeting at Montreal was . I have started a white paper on TADA that tries to think towards a Pattern Language for Text Analysis and Visualization. This white paper is not the language or a catalogue of patterns, but an attempt to orient myself towards what such a pattern language would be and what the dangers of such a move would be.

TextAnalyst – Text Mining or Text Analysis Software

Screen of TextAnalyst TextAnalyst is text mining and analysis software from Megaputer. It is hard, without buying it, to tell what it does. They do have what sounds like a neat plug-in for IE that does analysis on the web page you are looking at (see screenshot with this post.) The plug-in, TextAnalyst for Microsoft Internet Explorer summarizes web pages, provides a semantic network and allows natural language querying.