Having just finished teaching a course on Big Data and Text Analysis where I taught students Python I can appreciate a well written tutorial on Python. Python Programming for the Humanities by Folgert Karsdorpis a great tutorial for humanists new to programming that takes the form of a series of Jupyter notebooks that students can download. As the tutorials are notebooks, if students have set up Python on their computers then they can use the tutorials interactively. Karsdorp has done a nice job of weaving in cells where the student has to code and Quizes which reinforce the materials which strikes me as an excellent use of the IPython notebook model.
Text Analysis with Topic Models for the Humanities and Social Sciences (TAToM) consists of a series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning.
Stéfan Sinclair and I (mostly Stéfan) have also produced a textbook for teaching programming to humanists called The Art of Literary Text Analysis. These tutorials are also written as Jupyter notebooks so you can download them and play with them.
This directory contains 450 novels that appeared between 1770 and 1930 in German, French and English. It is designed for us in teaching and research.
Andrew Piper mentioned a corpus that he put together, txtlab Multilingual Novels. This corpus is of some 450 novels from the late 18th century to the early 20th (1920s). It has a gender mix and is not only English novels. This corpus was supported by SSHRC through the Text Mining the Novel project.
The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. Their collection is not as complete as the Internet Archive, which goes back much further, but it is available in large datasets for research.
Domenico Fiormonte has recently blogged about an interesting document he has by Father Busa that relates to a difficult moment in the history of the digital humanities in Italy in 2002. The two page “Conditional Agreement”, which I translate below, was given to Domenico and explained the terms under which Busa would agree to sign a letter to the Minister (of Education and Research) Moratti in response to Moratti’s public statement about the uselessness of humanities informatics. A letter was being prepared to be signed by a large number of Italian (and foreign) academics explaining the value of what we now call the digital humanities. Busa had the connections to get the letter published and taken seriously for which reason Domenico visited him to get his help, which ended up being conditional on certain things being made clear, as laid out in the document. Domenico kept the two pages Busa wrote and recently blogged about them. As he points out in his blog, these two pages are a mini-manifesto of Father Busa’s later views of the place and importance of what he called textual informatics. Domenico also points out how political is the context of these notes and the letter eventually signed and published. Defining the digital humanities is often about positioning the field in the larger academic and public political spheres we operate in.
Pour continuer le dialogue, on gagnerait à faire converser Hermeneutica avec des théories de la lecture comme celle d’Umberto Eco ou avec l’esthétique de la réception, représentée par Hans Robert Jauss et Wolfgang Iser. Aux yeux d’Umberto Eco (Lector in fabula), il n’y a à interpréter que là où le texte se tait. Ce sont tous les lieux d’ambivalence, les propositions implicites et les vides de l’œuvre, suscitant la coopération d’un lecteur qui met du sien dans le texte pour combler les blancs, qui font le propre du fonctionnement littéraire. Wolfgang Iser (L’Appel du texte) affirme de son côté que, loin de déduire le sens d’une œuvre de ses mots les plus utilisés, « l’essentiel d’un texte est ce qu’il passe sous silence ».
How can we analyze the gaps, the silences, or that which has not been written?
An article about authorship attribution led me to this nice site on Common Errors in English Usage. The site is for a book with that title, but the author Paul Brians has organized all the errors into a hypertext here. For example, here is the entry on why you shouldn’t use enjoy to.
The Canadian Writing Research Collaboratory (CWRC) today launched its Collaboratory. The Collaboratory is a distributed editing environment that allows projects to edit scholarly electronic texts (using CWRC Writer), manage editorial workflows, and publish collections. There are also links to other tools like CWRC Catalogue and Voyant (that I am involved in.) There is an impressive set of projects already featured in CWRC, but it is open to new projects and designed to help them.
Susan Brown deserves a lot of credit for imagining this, writing the CFI (and other) proposals, leading the development and now managing the release. I hope it gets used as it is a fabulous layer of infrastructure designed by scholars for scholars.
One important component in CWRC is CWRC-Writer, an in-browser XML editor that can be hooked into content management systems like the CWRC back-end. It allows for stand-off markup and connects to entity databases for tagging entities in standardized ways.
Eder has a nice page about the work he and ogthers in the Computational Stylistics Group are doing. In the workshop sessions I was able to attend he showed us how to set up and run his “stylo” package (PDF) that provides a simple user interface over R for doing stylometry. He also showed us how to then use Gephi for network visualization.