Python Programming for the Humanities by Folgert Karsdorp

Having just finished teaching a course on Big Data and Text Analysis where I taught students Python I can appreciate a well written tutorial on Python. Python Programming for the Humanities by Folgert Karsdorp is a great tutorial for humanists new to programming that takes the form of a series of Jupyter notebooks that students can download. As the tutorials are notebooks, if students have set up Python on their computers then they can use the tutorials interactively. Karsdorp has done a nice job of weaving in cells where the student has to code and Quizes which reinforce the materials which strikes me as an excellent use of the IPython notebook model.

I learned about this reading a more advanced set of tutorials from Allen Riddell for Dariah-DE, Text Analysis with Topic Models for the Humanities and Social Sciences. The title doesn’t do this collection of tutorials justice because they include a lot more than just Topic Models. There are advanced tutorials on all sorts of topics like machine learning and classification. See the index for the range of tutorials.

Text Analysis with Topic Models for the Humanities and Social Sciences (TAToM) consists of a series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning.

Stéfan Sinclair and I (mostly Stéfan) have also produced a textbook for teaching programming to humanists called The Art of Literary Text Analysis. These tutorials are also written as Jupyter notebooks so you can download them and play with them.

We are now reimplementing them with our own Voyant-based notebook environment called Spyral. See The Art of Literary Text Analysis with Spyral Notebooks. More on this in another blog entry.

txtlab Multilingual Novels

This directory contains 450 novels that appeared between 1770 and 1930 in German, French and English. It is designed for us in teaching and research.

Andrew Piper mentioned a corpus that he put together, txtlab Multilingual NovelsThis corpus is of some 450 novels from the late 18th century to the early 20th (1920s). It has a gender mix and is not only English novels.  This corpus was supported by SSHRC through the Text Mining the Novel project.


Common Crawl

The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. Their collection is not as complete as the Internet Archive, which goes back much further, but it is available in large datasets for research.

Naylor Report in Voyant

Correspondence Analysis (ScatterPlot) View

The Naylor Report (PDF) about research funding in Canada is out and we put it in Voyant. Here are some different

Continue reading Naylor Report in Voyant

Busa Letter Outlining Textual Informatics

Page 1 of “Conditional Agreement” by Father Busa

Domenico Fiormonte has recently blogged about an interesting document he has by Father Busa that relates to a difficult moment in the history of the digital humanities in Italy in 2002. The two page “Conditional Agreement”, which I translate below, was given to Domenico and explained the terms under which Busa would agree to sign a letter to the Minister (of Education and Research) Moratti in response to Moratti’s public statement about the uselessness of humanities informatics. A letter was being prepared to be signed by a large number of Italian (and foreign) academics explaining the value of what we now call the digital humanities. Busa had the connections to get the letter published and taken seriously for which reason Domenico visited him to get his help, which ended up being conditional on certain things being made clear, as laid out in the document. Domenico kept the two pages Busa wrote and recently blogged about them. As he points out in his blog, these two pages are a mini-manifesto of Father Busa’s later views of the place and importance of what he called textual informatics. Domenico also points out how political is the context of these notes and the letter eventually signed and published. Defining the digital humanities is often about positioning the field in the larger academic and public political spheres we operate in.

Continue reading Busa Letter Outlining Textual Informatics

Hermeneutica, une expérience numérique de l’interprétation

Arianne Mayer has posted a thorough review of our book Hermeneutica on Sens Public under the title, Hermeneutica, une expérience numérique de l’interprétation (in French.) She notes the centrality of dialogue and in the spirit of dialogue ends with some good questions about silence to keep the dialogue going,

Pour continuer le dialogue, on gagnerait à faire converser Hermeneutica avec des théories de la lecture comme celle d’Umberto Eco ou avec l’esthétique de la réception, représentée par Hans Robert Jauss et Wolfgang Iser. Aux yeux d’Umberto Eco (Lector in fabula), il n’y a à interpréter que là où le texte se tait. Ce sont tous les lieux d’ambivalence, les propositions implicites et les vides de l’œuvre, suscitant la coopération d’un lecteur qui met du sien dans le texte pour combler les blancs, qui font le propre du fonctionnement littéraire. Wolfgang Iser (L’Appel du texte) affirme de son côté que, loin de déduire le sens d’une œuvre de ses mots les plus utilisés, « l’essentiel d’un texte est ce qu’il passe sous silence ».

How can we analyze the gaps, the silences, or that which has not been written?

Geofeedia ‘allowed police to track protesters’

From the BBC a story about US start-up Geofeedia ‘allowed police to track protesters’. Geofeedia is apparently using social media data from Twitter, Facebook and Instagram to monitor activists and protesters for law enforcement. Access to these social media was changed once the ACLU reported on the surveillance product. The ACLU discovered the agreements with Geofeedia when they requested public records of California law enforcement agencies. Geofeedia was boasting to law enforcement about their access. The ACLU has released some of the documents of interest including a PDF of a Geofeedia Product Update email discussing “sentiment” analytics (May 18, 2016).

Frome the Geofeedia web site I was surprised to see that they are offering solutions for education too.

Common Errors in English Usage

An article about authorship attribution led me to this nice site on Common Errors in English Usage. The site is for a book with that title, but the author Paul Brians has organized all the errors into a hypertext here. For example, here is the entry on why you shouldn’t use enjoy to.

What does this have to do with authorship attribution? In a paper on Authorship Identification on the Large Scale the authors try using common errors as feature to discriminate potential authors.

CWRC/CSEC: The Canadian Writing Research Collaboratory

The Canadian Writing Research Collaboratory (CWRC) today launched its Collaboratory. The Collaboratory is a distributed editing environment that allows projects to edit scholarly electronic texts (using CWRC Writer), manage editorial workflows, and publish collections. There are also links to other tools like CWRC Catalogue and Voyant (that I am involved in.) There is an impressive set of projects already featured in CWRC, but it is open to new projects and designed to help them.

Susan Brown deserves a lot of credit for imagining this, writing the CFI (and other) proposals, leading the development and now managing the release. I hope it gets used as it is a fabulous layer of infrastructure designed by scholars for scholars.

One important component in CWRC is CWRC-Writer, an in-browser XML editor that can be hooked into content management systems like the CWRC back-end. It allows for stand-off markup and connects to entity databases for tagging entities in standardized ways.


At the European Summer University in Digital Humanities 2016 I was luck to be able to attend some sessions on Stylometry run by Maciej Eder. In his historical review he mentioned people like Valla and Mendenhall, but also mentioned a fellow Pole, Wincenty Lutoslawksi whose book The origin and growth of Plato’s logic; with an account of Plato’s style and of the chronology of his writings (1897) is the first to use the term “stylometry”. Lutoslawski develops a Theory of Stylometry and reviewed “500 peculiarities of Plato’s style” as part of his work on Plato’s logic. The nice thing is that the book is available through the Internet Archive.

Eder has a nice page about the work he and ogthers in the Computational Stylistics Group are doing. In the workshop sessions I was able to attend he showed us how to set up and run his “stylo” package (PDF) that provides a simple user interface over R for doing stylometry. He also showed us how to then use Gephi for network visualization.