Archive for the ‘Methods’ Category

The Rise and Fall Tool-Related Topics in CHum

Tuesday, May 3rd, 2016
Tool Network Image

Tool network with COCOA selected

I just found out that a paper we gave in 2014 was just published. See The Rise and Fall Tool-Related Topics in CHum. Here is the abstract:

What can we learn from the discourse around text tools? More than might be expected. The development of text analysis tools has been a feature of computing in the humanities since IBM supported Father Busa’s production of the Index Thomisticus (Tasman 1957). Despite the importance of tools in the digital humanities (DH), few have looked at the discourse around tool development to understand how the research agenda changed over the years. Recognizing the need for such an investigation a corpus of articles from the entire run of Computers and the Humanities (CHum) was analyzed using both distant and close reading techniques. By analyzing this corpus using traditional category assignments alongside topic modelling and statistical analysis we are able to gain insight into how the digital humanities shaped itself and grew as a discipline in what can be considered its “middle years,” from when the field professionalized (through the development of journals like CHum) to when it changed its name to “digital humanities.” The initial results (Simpson et al. 2013a; Simpson et al. 2013b), are at once informative and surprising, showing evidence of the maturation of the discipline and hinting at moments of change in editorial policy and the rise of the Internet as a new forum for delivering tools and information about them.

The Index Thomisticus as Project

Monday, March 14th, 2016

This is a story from early in the technological revolution, when the application was out searching for the hardware, from a time before the Internet, a time before the PC, before the chip, before the mainframe. From a time even before programming itself. (Winter 1999, 3)


Father Busa is rightly honoured as one of the first humanists to use computing for a humanities research task. He is considered the founder of humanities computing for his innovative application of information technology and for the considerable influence of his project and methods, not to mention his generosity to others. He did not only work out how use the information technology of the late 1940s and 1950s, but he pioneered a relationship with IBM around language engineering and with their support generously shared his knowledge widely. Ironically, while we have all heard his name and the origin story of his research into presence in Aquinas, we know relatively little about what actually occupied his time – the planning and implementation of what was for its time one of the major research computing projects, the Index Thomsticus.

This blog essay is an attempt to outline some of the features of the Index Thomisticus as a large-scale information technology project as a way of opening a discussion on the historiography of computing in the humanities. This essay follows from a two-day visit to the Busa Archives at the Università Cattolica del Sacro Cuore. This visit was made possible by Marco Carlo Passarotti who directs the “Index Thomisticus” Treebank project in CIRCSE (Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione – Interdisciplinary Centre for Research into the Computerization of Expressive Signs) which evolved out of GIRCSE (Gruppo not Centro – or Group not Centre), the group that Father Busa helped form in the 1980s. Passarotti not only introduced me to the archives, he also helped correct this blog as he is himself an archive of stories and details. Growing up in Gallarate, his family knew Busa, he studied under Busa, he took over the project, and he is one of the few who can read Busa’s handwriting.


Original GIRCSE Plaque kept by Passarotti


Canadian Spies Collect Domestic Emails in Secret Security Sweep

Thursday, February 26th, 2015

The Intercept and CBC have been collaborating on stories based on documents leaked by Edward Snowden. One recent story is about how Canadian Spies Collect Domestic Emails in Secret Security Sweep. CSE is collecting email going to the government and flagging suspect emails for analysts.

An earlier story titled CSE’s Levitation project: Expert says spy agencies ‘drowning in data’ and unable to follow leads, tells about the LEVITATION project that monitors file uploads to free file hosting sites. The idea is to identify questionable uploads and then to figure out who is uploading the materials.

Glenn Greenwald (see the embedded video) questions the value of this sort of mass surveillance. He suggests that mass surveillance impedes the ability to find terrorists attacks. The problem is not getting more information, but connecting the dots of what one has. In fact the slides that you can get to from these stories both show that CSE is struggling with too much information and analytical challenges.

My Very Own Voyant Workshop

Tuesday, July 8th, 2014

Stéfan Sinclair and I just finished a workshop on My Very Own Voyant. The workshop focused on how to run VoyantServer on your local machine. This allows you to run Voyant locally. There are all sorts of reasons to run locally:

  • It runs faster
  • You can upload large texts faster
  • It can process larger text corpora
  • You can control the server
  • You can keep your corpora confidential

You can download VoyantServer and read instructions here.

The Isolator, A Bizarre Helmet For Encouraging Concentration (1925)

Wednesday, May 21st, 2014

From Geoff I learned about The Isolator, A Bizarre Helmet For Encouraging Concentration (1925). The Isolator was developed in 1925 by Hugo Gernsback a science fiction pioneer (and editor of Science and Invention magazine.) The idea is to force you to focus on your writing (with lots of oxygen.)

One wonders if it works? Could it be even more useful now?

Front Row to Fashion Week –

Saturday, April 12th, 2014

The New York Times has an interesting way of visualizing fashion that you can see in their article Front Row to Fashion Week – Interactive Feature. They have abstracted the colour hues to create small swatches of different designers who showed at the New York Fashion Week. These “sparklines” or sparkboxes are an interesting way to compare the shows by designers.

Social Digital Scholarly Editing

Tuesday, July 16th, 2013

On July 11th and 12th I was at a conference in Saskatoon on Social Digital Scholarly Editing. This conference was organized by Peter Robinson and colleagues at the University of Saskatchewan. I kept conference notes here.

I gave a paper on “Social Texts and Social Tools.” My paper argued for text analysis tools as a “reader” of editions. I took the extreme case of big data text mining and what scraping/mining tools want in a text and don’t want in a text. I took this extreme view to challenge the scholarly editing view that the more interpretation you put into an edition the better. Big data wants to automate the process of gathering and mining texts – big data wants “clean” texts that don’t have markup, annotations, metadata and other interventions that can’t be easily removed. The variety of markup in digital humanities projects makes it very hard to clean them.

The response was appreciative of the provocation, but (thankfully) not convinced that big data was the audience of scholarly editors.

Tool Discourse

Thursday, February 21st, 2013

Character Density by Year in Tool DiscourseWe are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.

Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.

“can” 2305
“one” 1996
“text” 1940
“word” 1931
“words” 1859
“program” 1606
“ii” 1514 (Not sure why)
“will” 1361
“language” 1307
“data” 1285
“two” 1188
“system” 1183
“computer” 1116
“used” 1115
“use” 942
“user” 939
“file” 890
“first” 870
“may” 853
“also” 837

Globalization Compendium Archive

Thursday, April 19th, 2012

I have been working for a while on archiving the Globalization Compendium which I worked on. Yesterday I got it archived in two Institutional Repositories:

In both cases there is a Zip of a BagIt bag with the XML files, code and other documentation from the site. My first major deposit.

Old Bailey Trials Are Tabulated for Scholars Online

Thursday, August 18th, 2011

The New York Times now has an article on the Criminal Intent project I was part of. See, Old Bailey Trials Are Tabulated for Scholars Online. They quote a historian who is sceptical of the results of mining, though he appreciates the resource.

“The Old Bailey Online project has done a great service in making those sources widely (and costlessly) available,” Mr. Langbein wrote in an e-mail. But he complained that the claims about data mining have “a breathless quality: ‘you can expect big things from us,’ but as yet it’s all method and no results.” He said that the new findings belittle the work of a generation of scholars who focused on the 18th century as the turning point in the evolution of the criminal justice system.

Alas, he seems didn’t read our report, but the summary in the Chronicle. It is easy to use cute phrases like “breathless quality”, but is he right? Time will tell, but I think the historians on our team have backed up the results found with mining and they never belittled the work of previous scholars – we saw ourselves building on it.

What can mining do? I think mining can give you a big picture so that you see the forest rather than trees in a way that no one could before. Conclusions about the shape of the forest have to be checked against other evidence, but the results of mining is evidence that is not breathless even if it takes your breath away. As Bill Turkel put it,

Mr. Turkel, who developed some of the digital tools, said that data mining reveals unexpected trends and connections that no one would have thought to look for before. Previous scholars “tended to cherry-pick anecdotes without having a sense that it was possible to measure all of that text and treat the whole archive as a single unit,” he said.

Of course, if you then leverage traditional evidence to buttress your argument then the mining is forgotten or trivialized.