AI Weirdness

I just came across a neat site called AI Weirdness. The site describes all sorts of “weird” experiments in learning neural networks. Some examples:

The site has a nice FAQ that describes her tools and how to learn how to do it.

Franken-algorithms: the deadly consequences of unpredictable code

The death of a woman hit by a self-driving car highlights an unfolding technological crisis, as code piled on code creates ‘a universe no one fully understands’

The Guardian has a good essay by Andrew Smith about Franken-algorithms: the deadly consequences of unpredictable code. The essay starts with the obvious problems of biased algorithms like those documented by Cathy O’Neil in Weapons of Math Destruction. It then goes further to talk about cases where algorithms are learning on the fly or are so complex that their behaviour becomes unpredictable. An example is high-frequency trading algorithms that trade on the stock market. These algorithmic traders try to outwit each other and learn which leads to unpredictable “flash crashes” when they go rogue.

The problem, he (George Dyson) tells me, is that we’re building systems that are beyond our intellectual means to control. We believe that if a system is deterministic (acting according to fixed rules, this being the definition of an algorithm) it is predictable – and that what is predictable can be controlled. Both assumptions turn out to be wrong.

The good news is that, according to one of the experts consulted this could lead to “a golden age for philosophy” as we try to sort out the ethics of these autonomous systems.

EaaSI | The Software Preservation Network

I just learned about a new project called EaaSI | The Software Preservation Network. Stanford will be one of the nodes. They are looking at how to provide emulation as a service. They are using technology from Freiburg called bwFLA Emulation as Service.

Emulation as a strategy for digital preservation is about to become an accepted technology for memory institutions as a method for coping a large variety of complex digital objects. Hence, the demand for ready-made and especially easy-to-use emulation services will grow. In order to provide user-friendly emulation services a scalable, distributed system model is required to be run on heterogeneous Grid or Cluster infrastructure.

The Emulation-as-a-Service architecture simplifies access to preserved digital assets allowing end users to interact with the original environments running on different emulators. Ready-made emulation components provide a flexible web service API allowing for development of individual and tailored digital preservation workflows.

Emulation is going to be important to game preservation. Already the Internet Archive is making games and other software available with emulation. There is also the MAME (Multiple Arcade Machine Emulator) project that is a community project that has traditionally allowed people to play older games right from the bit sequence off cartridges.

Python Programming for the Humanities by Folgert Karsdorp

Having just finished teaching a course on Big Data and Text Analysis where I taught students Python I can appreciate a well written tutorial on Python. Python Programming for the Humanities by Folgert Karsdorp is a great tutorial for humanists new to programming that takes the form of a series of Jupyter notebooks that students can download. As the tutorials are notebooks, if students have set up Python on their computers then they can use the tutorials interactively. Karsdorp has done a nice job of weaving in cells where the student has to code and Quizes which reinforce the materials which strikes me as an excellent use of the IPython notebook model.

I learned about this reading a more advanced set of tutorials from Allen Riddell for Dariah-DE, Text Analysis with Topic Models for the Humanities and Social Sciences. The title doesn’t do this collection of tutorials justice because they include a lot more than just Topic Models. There are advanced tutorials on all sorts of topics like machine learning and classification. See the index for the range of tutorials.

Text Analysis with Topic Models for the Humanities and Social Sciences (TAToM) consists of a series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning.

Stéfan Sinclair and I (mostly Stéfan) have also produced a textbook for teaching programming to humanists called The Art of Literary Text Analysis. These tutorials are also written as Jupyter notebooks so you can download them and play with them.

We are now reimplementing them with our own Voyant-based notebook environment called Spyral. See The Art of Literary Text Analysis with Spyral Notebooks. More on this in another blog entry.


I’ve been playing with DataCamp‘s Python lessons and they are quite good. Python is taught in the context of data analysis rather than the turtle drawing of How to Think Like a Computer Scientist. They have a nice mix of video tutorials and then exercises where you get a tripartite screen (see above.) You have an explanation and instructions on the left, a short script to fill in on the upper-right and interactive python shell where you can try stuff below.

Continue reading DataCamp

Common Errors in English Usage

An article about authorship attribution led me to this nice site on Common Errors in English Usage. The site is for a book with that title, but the author Paul Brians has organized all the errors into a hypertext here. For example, here is the entry on why you shouldn’t use enjoy to.

What does this have to do with authorship attribution? In a paper on Authorship Identification on the Large Scale the authors try using common errors as feature to discriminate potential authors.

Instant History conference

This weekend I gave a talk at a lovely one day conference on Instant History, The Postwar Digital Humanities and Their Legacies. My conference notes are here. The conference was organized by Paul Eggert, among others. Steve Jones, Ted Underwood and Laura Mandell also talked.

I gave the first talk on “Tremendous Labour: Busa’s Methods” – a paper coming from the work Stéfan Sinclair and I are doing. I talked about the reconstruction of Busa’s Index project. I claimed that Busa and Tasman made two crucial innovations. The first was figuring out how to represent data on punched cards so that it could be processed (the data structures). The second was figuring out how to use the punched card machines at hand to tokenize unstructured text. I walked through what we know about their actual methods and talked about our attempts to replicate them:

I was lucky to have two great respondents (Kyle Roberts and Schlomo Argamon) who both pointed out important contextual issues to consider, as in:

  • We need to pay attention to the Jesuit and spiritual dimensions of Busa’s work.
  • We need to think about the dialectic of those critical of computing and those optimistic about it.

Making Algorithms Accountable

ProPublica has a great op-ed about Making Algorithms Accountable. The story starts from a decision from the Wisconsin Supreme Court on computer-generated risk (of recidivism) scores. The scores used in Wisconsin come from Northpointe who provide the scores as a service based on a proprietary alogorithm that seems biased against blacks and not that accurate. The story highlights the lack of any legislation regarding algorithms that can affect our lives.

Update: ProPublica has responded to a Northpointe critique of their findings.

The Rise and Fall Tool-Related Topics in CHum

Tool Network Image
Tool network with COCOA selected

I just found out that a paper we gave in 2014 was just published. See The Rise and Fall Tool-Related Topics in CHum. Here is the abstract:

What can we learn from the discourse around text tools? More than might be expected. The development of text analysis tools has been a feature of computing in the humanities since IBM supported Father Busa’s production of the Index Thomisticus (Tasman 1957). Despite the importance of tools in the digital humanities (DH), few have looked at the discourse around tool development to understand how the research agenda changed over the years. Recognizing the need for such an investigation a corpus of articles from the entire run of Computers and the Humanities (CHum) was analyzed using both distant and close reading techniques. By analyzing this corpus using traditional category assignments alongside topic modelling and statistical analysis we are able to gain insight into how the digital humanities shaped itself and grew as a discipline in what can be considered its “middle years,” from when the field professionalized (through the development of journals like CHum) to when it changed its name to “digital humanities.” The initial results (Simpson et al. 2013a; Simpson et al. 2013b), are at once informative and surprising, showing evidence of the maturation of the discipline and hinting at moments of change in editorial policy and the rise of the Internet as a new forum for delivering tools and information about them.

The Index Thomisticus as Project

This is a story from early in the technological revolution, when the application was out searching for the hardware, from a time before the Internet, a time before the PC, before the chip, before the mainframe. From a time even before programming itself. (Winter 1999, 3)


Father Busa is rightly honoured as one of the first humanists to use computing for a humanities research task. He is considered the founder of humanities computing for his innovative application of information technology and for the considerable influence of his project and methods, not to mention his generosity to others. He did not only work out how use the information technology of the late 1940s and 1950s, but he pioneered a relationship with IBM around language engineering and with their support generously shared his knowledge widely. Ironically, while we have all heard his name and the origin story of his research into presence in Aquinas, we know relatively little about what actually occupied his time – the planning and implementation of what was for its time one of the major research computing projects, the Index Thomsticus.

This blog essay is an attempt to outline some of the features of the Index Thomisticus as a large-scale information technology project as a way of opening a discussion on the historiography of computing in the humanities. This essay follows from a two-day visit to the Busa Archives at the Università Cattolica del Sacro Cuore. This visit was made possible by Marco Carlo Passarotti who directs the “Index Thomisticus” Treebank project in CIRCSE (Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione – Interdisciplinary Centre for Research into the Computerization of Expressive Signs) which evolved out of GIRCSE (Gruppo not Centro – or Group not Centre), the group that Father Busa helped form in the 1980s. Passarotti not only introduced me to the archives, he also helped correct this blog as he is himself an archive of stories and details. Growing up in Gallarate, his family knew Busa, he studied under Busa, he took over the project, and he is one of the few who can read Busa’s handwriting.


Original GIRCSE Plaque kept by Passarotti

Continue reading The Index Thomisticus as Project