Stylometry

At the European Summer University in Digital Humanities 2016 I was luck to be able to attend some sessions on Stylometry run by Maciej Eder. In his historical review he mentioned people like Valla and Mendenhall, but also mentioned a fellow Pole, Wincenty Lutoslawksi whose book The origin and growth of Plato’s logic; with an account of Plato’s style and of the chronology of his writings (1897) is the first to use the term “stylometry”. Lutoslawski develops a Theory of Stylometry and reviewed “500 peculiarities of Plato’s style” as part of his work on Plato’s logic. The nice thing is that the book is available through the Internet Archive.

Eder has a nice page about the work he and ogthers in the Computational Stylistics Group are doing. In the workshop sessions I was able to attend he showed us how to set up and run his “stylo” package (PDF) that provides a simple user interface over R for doing stylometry. He also showed us how to then use Gephi for network visualization.

 

They know (on surveillance)

They know is a must see design project by Christian Gross from the Interface Design Programme at University of Applied Sciences in Potsdam (FHP), Germany. The idea behind the project, described in the They Know showcase for FHP, is,

I could see in my daily work how difficult it was to inform people about their privacy issues. Nobody seemed to care. My hypothesis was that the whole subject was too complex. There were no examples, no images that could help the audience to understand the process behind the mass surveillance.

The answer is to mock up a design fiction of an NSA surveillance dashboard based on what we know and then a video describing a fictional use of it to track an architecture student from Berlin. It seems to me the video and mock designs nicely bring together a number of things we can infer about the tools they have.

Congress 2016 (CSDH and CGSA)

As I get ready to fly back to Germany I’m finishing my conference notes on Congress 2016 (CSDH and CGSA). Calgary was nice and not to hot for Congress and we were welcomed by a malware attack on Congress that meant that many employees couldn’t use their machines. Nevertheless the conference seemed very well organized and the campus lovely.

My conference notes cover mostly the Canadian Society for Digital Humanities, but also DHSI at Congress, where I presented CWRC for Susan Brown, and the last day of the Canadian Game Studies Association. Here are some general reflections.

  • I am impressed by how the CGSA is growing and how vital it is. It has as many attendees as CSDH, but younger and enthusiastic attendees rather than tired. Much of the credit goes to the long term leadership of people like Jen Jensen.
  • CSDH has some terrific keynotes this year starting with Ian Milligan, then Tara McPherson, and finally Diane Jakacki.
  • It was great to see people coming up from the USA as CSDH/SCHN gets a reputation for being a welcoming conference in North America.
  • Stéfan Sinclair and I had a book launch for Hermeneutica: Computer-Assisted Interpretation in the Humanities at which Chad Gaffield said a few words. It was gratifying that so many friends came out for this.

At the CSDH AGM we passed a motion to adopt Guidelines on Digital Scholarship in the Humanities (Google Doc). The Guidelines discuss the value of digital work and provide guidelines for evaluation:

Programs of research, which are by nature exploratory, may require faculty members to take up modes of research that depart from methods they have previously used, therefore the form the resulting scholarship takes should not prejudice its evaluation. Original works in new media forms, whether digital or other, should be evaluated as scholarship following best practices if so presented. Likewise, researchers should be encouraged to experiment with new forms when disseminating knowledge, confident that their experiments will be fairly evaluated.

The Guidelines have a final section on Documented Deposit:

Digital media have not only expanded the forms that research can take, but research practices are also changing in the face of digital distribution and open access publishing. In particular we are being called on to preserve research data and to share new knowledge openly. Universities that have the infrastructure should encourage faculty to deposit not only digital works, but also curated datasets and preprint versions of papers/monographs with documentation in an open access form. These can be deposited with an embargo in digital archives as part of good practice around research dissemination and preservation. The deposit of work, including online published work, even if it is available elsewhere, ensures the long-term preservation by ensuring that there are copies in more than one place. Further, libraries can then ensure that the work is not only preserved, but is discoverable in the long term as publications come and go.

 

Replication as a way of knowing in the digital humanities

Poster for Replication Talk
Poster for Replication Talk

At the end of April I gave a talk at the University of Würzburg on Replication as a way of knowing in the digital humanities. This was sponsored by the Dr. Fotis Jannidis who holds the position of Chair of computer philology and modern German literature there. He and others have built a digital humanities program and interesting research agenda around text mining and German literature. The talk tried out some new ideas Stéfan Sinclair and I are working on. The abstract read:

Much new knowledge in the digital humanities comes from the practices of encoding and programming not through discourse. These practices can be considered forms of modelling in the active sense of making by modelling or, as I like to call them, practices of thinking-through. Alas, these practices and the associated ways of knowing are not captured or communicated very well through the usual academic forms of publication which come out of discursive knowledge traditions. In this talk I will argue for “replication” as a way of thinking-through the making of code. I will give examples and conclude by arguing that such thinking-through replication is critical to the digital literacy needed in the age of big data and algorithms.

The Rise and Fall Tool-Related Topics in CHum

Tool Network Image
Tool network with COCOA selected

I just found out that a paper we gave in 2014 was just published. See The Rise and Fall Tool-Related Topics in CHum. Here is the abstract:

What can we learn from the discourse around text tools? More than might be expected. The development of text analysis tools has been a feature of computing in the humanities since IBM supported Father Busa’s production of the Index Thomisticus (Tasman 1957). Despite the importance of tools in the digital humanities (DH), few have looked at the discourse around tool development to understand how the research agenda changed over the years. Recognizing the need for such an investigation a corpus of articles from the entire run of Computers and the Humanities (CHum) was analyzed using both distant and close reading techniques. By analyzing this corpus using traditional category assignments alongside topic modelling and statistical analysis we are able to gain insight into how the digital humanities shaped itself and grew as a discipline in what can be considered its “middle years,” from when the field professionalized (through the development of journals like CHum) to when it changed its name to “digital humanities.” The initial results (Simpson et al. 2013a; Simpson et al. 2013b), are at once informative and surprising, showing evidence of the maturation of the discipline and hinting at moments of change in editorial policy and the rise of the Internet as a new forum for delivering tools and information about them.

IBM to close Many Eyes

I just discovered that IBM to close Many Eyes. This is a pity. It was  great environment that let people upload data and visualize it in different ways. I blogged about it ages ago (in computer ages anyway.) In particular I liked their Word Tree which seems one of the best ways to explore language use.

It seems that some of the programmers moved on and that IBM is now focusing on Watson Analytics.

What’s in a number? William Shakespeare’s legacy analysed

shakespeare

The Guardian published an article on What’s in a number? William Shakespeare’s legacy analysed (April 22, 2016). This article is part of a Shakespeare 400 series in honour of the 400th anniversary of the bard’s death. The article is introduced thus:

Shakespeare’s ability to distil human nature into an elegant turn of phrase is rightly exalted – much remains vivid four centuries after his death. Less scrutiny has been given to statistics about the playwright and his works, which tell a story in their own right. Here we analyse the numbers behind the Bard.

The authors offer a series of visualizations of statistics about Shakespeare that are rather more of a tease than anything really interesting. They also ignore the long history of using quantitative methods to study Shakespeare going back to Mendenhall’s study of authorship using word lengths.

Mendenhall, T. C. (1901). “A Mechanical Solution of a Literary Problem.” The Popular Science Monthly. LX(7): 97-105.

Literature Measured

I finally got around to reading the latest Pamphlets of the Stanford Literary Lab. This pamphlet, 12. Literature Measured (PDF) written by Franco Moretti, is a reflection on the Lab’s research practices and why they chose to publish pamphlets. It is apparently the introduction to a French edition of the pamphlets. The pamphlet makes some important points about their work and the digital humanities in general.

Images come  first, in our pamphlets, because – by visualizing empirical findings – they constitute the specific object of study of computational criticism; they are our “text”; the counterpart to what a well-defined excerpt is to close reading. (p. 3)

I take this to mean that the image shows the empirical findings or the model drawn from the data. That model is studied through the visualization. The visualization is not an illustration or supplement.

By frustrating our expectations, failed experiments “estrange” our natural habits of thought, offering us a chance to transform them. (p. 4)

The pamphlet has a good section on failure and how that is not just a rhetorical ploy, but important to research. I would add that only certain types of failure are so. There are dumb failures too. He then moves on to the question of successes in the digital humanities and ends with an interesting reflection on  how the digital humanities and Marxist criticism don’t seem to have much to do with each other.

But he (Bordieu) also stands for something less obvious, and rather perplexing: the near-absence from digital humanities, and from our own work as well, of that other sociological approach that is Marxist criticism (Raymond Williams, in “A Quantitative Literary History”, being the lone exception). This disjunction – perfectly mutual, as the indiference of Marxist criticism is only shaken by its occasional salvo against digital humanities as an accessory to the corporate attack on the university – is puzzling, considering the vast social horizon which digital archives could open to historical materialism, and the critical depth which the latter could inject into the “programming imagination”. It’s a strange state of a airs; and it’s not clear what, if anything, may eventually change it. For now, let’s just acknowledge that this is how things stand; and that – for the present writer – something needs to be done. It would be nice if, one day, big data could lead us back to big questions. (p. 7)

Where Probability Meets Literature and Language: Markov Models for Text Analysis

3quarksdaily, one of my favourite sites to read just posted a very nice essay by Sanjukta Paul on Where Probability Meets Literature and Language: Markov Models for Text Analysis. The essay starts with Markov, who in the 19th century was doing linguistic analysis by hand and goes to authorship attribution by people like Fiona Tweedie (the image above is from a study she co-authored). It also explains markov models on the way.