Common Errors in English Usage

An article about authorship attribution led me to this nice site on Common Errors in English Usage. The site is for a book with that title, but the author Paul Brians has organized all the errors into a hypertext here. For example, here is the entry on why you shouldn’t use enjoy to.

What does this have to do with authorship attribution? In a paper on Authorship Identification on the Large Scale the authors try using common errors as feature to discriminate potential authors.

Instant History conference

This weekend I gave a talk at a lovely one day conference on Instant History, The Postwar Digital Humanities and Their Legacies. My conference notes are here. The conference was organized by Paul Eggert, among others. Steve Jones, Ted Underwood and Laura Mandell also talked.

I gave the first talk on “Tremendous Labour: Busa’s Methods” – a paper coming from the work Stéfan Sinclair and I are doing. I talked about the reconstruction of Busa’s Index project. I claimed that Busa and Tasman made two crucial innovations. The first was figuring out how to represent data on punched cards so that it could be processed (the data structures). The second was figuring out how to use the punched card machines at hand to tokenize unstructured text. I walked through what we know about their actual methods and talked about our attempts to replicate them:

I was lucky to have two great respondents (Kyle Roberts and Schlomo Argamon) who both pointed out important contextual issues to consider, as in:

  • We need to pay attention to the Jesuit and spiritual dimensions of Busa’s work.
  • We need to think about the dialectic of those critical of computing and those optimistic about it.

CWRC/CSEC: The Canadian Writing Research Collaboratory

The Canadian Writing Research Collaboratory (CWRC) today launched its Collaboratory. The Collaboratory is a distributed editing environment that allows projects to edit scholarly electronic texts (using CWRC Writer), manage editorial workflows, and publish collections. There are also links to other tools like CWRC Catalogue and Voyant (that I am involved in.) There is an impressive set of projects already featured in CWRC, but it is open to new projects and designed to help them.

Susan Brown deserves a lot of credit for imagining this, writing the CFI (and other) proposals, leading the development and now managing the release. I hope it gets used as it is a fabulous layer of infrastructure designed by scholars for scholars.

One important component in CWRC is CWRC-Writer, an in-browser XML editor that can be hooked into content management systems like the CWRC back-end. It allows for stand-off markup and connects to entity databases for tagging entities in standardized ways.

Digital Humanities 2016 in Kraków

The week of the 11th tot he 16th of July was Digital Humanities 2016 in Kraków. This conference was, in my opinion, the best organized DH conference I have attended (and I have attended most of them since the first joint ACH-ALLC conference in Toronto in 1989.) Jan Rybicki and Maciej Eder deserve credit for a lovely conference.

My conference notes are on philosophi.ca so I won’t go into a lot of detail here. Some of the themes worth noting include:

  • Diversity. There was a lot of discussion and sessions dedicated to diversity of different sorts. Real differences were aired that I think most people felt was good.
  • Pedagogy. Perhaps it is what I attended, but it seemed that there was a new energy around pedagogical discussions. I was impressed by the creative approaches and also by the large-scale projects like Dariah-EU working group on Training and Education.
  • Web Historiography. There were a number of talks/panels that drew on the web as evidence. I was pleased to see a discussion of the need to think historiographically about the web. What is archived? What is missing?
  • Posters. There was a great set of posters. Here is a link to photos I took of a selection.

Some of the events and papers I was involved in include:

  • New Scholars Symposium which was supported by CHCI and centerNet. I co-organized this with Rachel Hendry.
  • Innovations in Digital Humanities Pedagogy: Local, National, and International Training. I was part of a one day mini-conference on training and gave a short presentation on Visualization at the final panel on Publication Approaches Supporting DH Pedagogy.
  • CWRC & Voyant Tools: Text Repository Meets Text Analysis. I was one of three instructors on a workshop on CWRC and Voyant.
  • Curating Just-In-Time Datasets from the Web. I gave a paper on a project that is scraping Twitter that was coauthored with Todd Suomela and Ryan Chartier.
  • The Trace of Theory: Extracting Subsets from Large Collections. I introduced and gave one of the short papers on a panel of work we did as part of the Text Mining the Novel project with the HathiTrust Research Center.
  • Web Historiography – A New Challenge for Digital Humanities? I gave a short presentation on the Ethics of Scraping Twitter.

Congress 2016 (CSDH and CGSA)

As I get ready to fly back to Germany I’m finishing my conference notes on Congress 2016 (CSDH and CGSA). Calgary was nice and not to hot for Congress and we were welcomed by a malware attack on Congress that meant that many employees couldn’t use their machines. Nevertheless the conference seemed very well organized and the campus lovely.

My conference notes cover mostly the Canadian Society for Digital Humanities, but also DHSI at Congress, where I presented CWRC for Susan Brown, and the last day of the Canadian Game Studies Association. Here are some general reflections.

  • I am impressed by how the CGSA is growing and how vital it is. It has as many attendees as CSDH, but younger and enthusiastic attendees rather than tired. Much of the credit goes to the long term leadership of people like Jen Jensen.
  • CSDH has some terrific keynotes this year starting with Ian Milligan, then Tara McPherson, and finally Diane Jakacki.
  • It was great to see people coming up from the USA as CSDH/SCHN gets a reputation for being a welcoming conference in North America.
  • Stéfan Sinclair and I had a book launch for Hermeneutica: Computer-Assisted Interpretation in the Humanities at which Chad Gaffield said a few words. It was gratifying that so many friends came out for this.

At the CSDH AGM we passed a motion to adopt Guidelines on Digital Scholarship in the Humanities (Google Doc). The Guidelines discuss the value of digital work and provide guidelines for evaluation:

Programs of research, which are by nature exploratory, may require faculty members to take up modes of research that depart from methods they have previously used, therefore the form the resulting scholarship takes should not prejudice its evaluation. Original works in new media forms, whether digital or other, should be evaluated as scholarship following best practices if so presented. Likewise, researchers should be encouraged to experiment with new forms when disseminating knowledge, confident that their experiments will be fairly evaluated.

The Guidelines have a final section on Documented Deposit:

Digital media have not only expanded the forms that research can take, but research practices are also changing in the face of digital distribution and open access publishing. In particular we are being called on to preserve research data and to share new knowledge openly. Universities that have the infrastructure should encourage faculty to deposit not only digital works, but also curated datasets and preprint versions of papers/monographs with documentation in an open access form. These can be deposited with an embargo in digital archives as part of good practice around research dissemination and preservation. The deposit of work, including online published work, even if it is available elsewhere, ensures the long-term preservation by ensuring that there are copies in more than one place. Further, libraries can then ensure that the work is not only preserved, but is discoverable in the long term as publications come and go.

 

Replication as a way of knowing in the digital humanities

Poster for Replication Talk
Poster for Replication Talk

At the end of April I gave a talk at the University of Würzburg on Replication as a way of knowing in the digital humanities. This was sponsored by the Dr. Fotis Jannidis who holds the position of Chair of computer philology and modern German literature there. He and others have built a digital humanities program and interesting research agenda around text mining and German literature. The talk tried out some new ideas Stéfan Sinclair and I are working on. The abstract read:

Much new knowledge in the digital humanities comes from the practices of encoding and programming not through discourse. These practices can be considered forms of modelling in the active sense of making by modelling or, as I like to call them, practices of thinking-through. Alas, these practices and the associated ways of knowing are not captured or communicated very well through the usual academic forms of publication which come out of discursive knowledge traditions. In this talk I will argue for “replication” as a way of thinking-through the making of code. I will give examples and conclude by arguing that such thinking-through replication is critical to the digital literacy needed in the age of big data and algorithms.

The Rise and Fall Tool-Related Topics in CHum

Tool Network Image
Tool network with COCOA selected

I just found out that a paper we gave in 2014 was just published. See The Rise and Fall Tool-Related Topics in CHum. Here is the abstract:

What can we learn from the discourse around text tools? More than might be expected. The development of text analysis tools has been a feature of computing in the humanities since IBM supported Father Busa’s production of the Index Thomisticus (Tasman 1957). Despite the importance of tools in the digital humanities (DH), few have looked at the discourse around tool development to understand how the research agenda changed over the years. Recognizing the need for such an investigation a corpus of articles from the entire run of Computers and the Humanities (CHum) was analyzed using both distant and close reading techniques. By analyzing this corpus using traditional category assignments alongside topic modelling and statistical analysis we are able to gain insight into how the digital humanities shaped itself and grew as a discipline in what can be considered its “middle years,” from when the field professionalized (through the development of journals like CHum) to when it changed its name to “digital humanities.” The initial results (Simpson et al. 2013a; Simpson et al. 2013b), are at once informative and surprising, showing evidence of the maturation of the discipline and hinting at moments of change in editorial policy and the rise of the Internet as a new forum for delivering tools and information about them.

Literature Measured

I finally got around to reading the latest Pamphlets of the Stanford Literary Lab. This pamphlet, 12. Literature Measured (PDF) written by Franco Moretti, is a reflection on the Lab’s research practices and why they chose to publish pamphlets. It is apparently the introduction to a French edition of the pamphlets. The pamphlet makes some important points about their work and the digital humanities in general.

Images come  first, in our pamphlets, because – by visualizing empirical findings – they constitute the specific object of study of computational criticism; they are our “text”; the counterpart to what a well-defined excerpt is to close reading. (p. 3)

I take this to mean that the image shows the empirical findings or the model drawn from the data. That model is studied through the visualization. The visualization is not an illustration or supplement.

By frustrating our expectations, failed experiments “estrange” our natural habits of thought, offering us a chance to transform them. (p. 4)

The pamphlet has a good section on failure and how that is not just a rhetorical ploy, but important to research. I would add that only certain types of failure are so. There are dumb failures too. He then moves on to the question of successes in the digital humanities and ends with an interesting reflection on  how the digital humanities and Marxist criticism don’t seem to have much to do with each other.

But he (Bordieu) also stands for something less obvious, and rather perplexing: the near-absence from digital humanities, and from our own work as well, of that other sociological approach that is Marxist criticism (Raymond Williams, in “A Quantitative Literary History”, being the lone exception). This disjunction – perfectly mutual, as the indiference of Marxist criticism is only shaken by its occasional salvo against digital humanities as an accessory to the corporate attack on the university – is puzzling, considering the vast social horizon which digital archives could open to historical materialism, and the critical depth which the latter could inject into the “programming imagination”. It’s a strange state of a airs; and it’s not clear what, if anything, may eventually change it. For now, let’s just acknowledge that this is how things stand; and that – for the present writer – something needs to be done. It would be nice if, one day, big data could lead us back to big questions. (p. 7)