Show and Tell at CHRIN

Stéphane Pouyllau’s photo of me presenting

Michael Sinatra invited me to a “show and tell” workshop at the new Université de Montréal campus where they have a long data wall. Sinatra is the Director of CRIHN (Centre de recherche interuniversitaire sur les humanitiés numériques) and kindly invited me to show what I am doing with Stéfan Sinclair and to see what others at CRIHN and in France are doing.

Continue reading Show and Tell at CHRIN

Conference notes for CSDH 2019

In early June I was at the Congress for the Humanities and Social Sciences. I took conference notes on the Canadian Society for Digital Humanities 2019 event and on the Canadian Game Studies Association conference, 2019. I was involved in a number of papers:

  • Exploring through Markup: Recovering COCOA. This paper looked at an experimental Voyant tool that allows one to use COCOA markup as a way of exploring a text in different ways. COCOA markup is a simple form of markup that was superseded by XML languages like those developed with the TEI. The paper recovered some of the history of markup and what we may have lost.

  • Designing for Sustainability: Maintaining TAPoR and This paper was presented by Holly Pickering and discussed the processes we have set up to maintain TAPoR and

  • Our team also had two posters, one on “Generative Ethics: Using AI to Generate” that showed a toy that generates statements about artificial intelligence and ethics. The other, “Discovering Digital Methods: An Exploration of Methodica for Humanists” showed what we are doing with

txtlab Multilingual Novels

This directory contains 450 novels that appeared between 1770 and 1930 in German, French and English. It is designed for us in teaching and research.

Andrew Piper mentioned a corpus that he put together, txtlab Multilingual NovelsThis corpus is of some 450 novels from the late 18th century to the early 20th (1920s). It has a gender mix and is not only English novels.  This corpus was supported by SSHRC through the Text Mining the Novel project.


Text Mining The Novel 2015


On Thursday and Friday (Oct. 22nd and 23rd) I was at the 2nd workshop for the Text Mining the Novel project. My conference notes are here Text Mining The Novel 2015. We had a number of great papers on the issue of genre (this year’s topic.) Here are some general reflections:

  • The obvious weakness of text mining is that it operates on the novel as text, specifically digital text (or string.) We need to find ways to also study the novel as material object (thing), as a social object, as a performance (of the reader), and as an economic object in a market place. Then we also have to find ways to connect these.
  • So many analytical and mining processes depend on bags of words from dictionaries to topics. Is this a problem or a limitation? Can we try to abstract characters, plot, or argument.
  • I was interested in the philosophical discussions around the epistemological in novels and philosophical claims about language and literature.


NovelTM: Text Mining the Novel

This week SSHRC announced the new partnership grants awarded including one I am a co-investigator on, NovelTM: Text Mining the Novel.

This project brings together researchers and partners from 21 different academic and non-academic institutions to produce the first large-scale quantitative history of the novel. Our aim is to bring new computational approaches in the field of text mining to the study of literature as well as bring the unique knowledge of literary studies to bear on larger debates about data mining and the place of information technology within society.

NovelTM is led by Andrew Piper at McGill University. At the University of Alberta I will be gathering a team that will share the resulting computing methods through TAPoR and developing recipes or tutorials so that others can try them.

Rap Game Riff Raff Textual Analysis

Tyler Trkowski has written a Feature for NOISEY (Music by Vice) on Rap Game Riff Raff Textual Analysis. It is a neat example of text analysis outside the academy. He used Voyant and Many Eyes to analyze Riff Raff’s lyrical canon. (Riff Raff, or Horst Christian Simco, is an eccentric rapper.) What is neat is that they embedded a Voyant word cloud right into their essay along with Word Trees from Many Eyes. Riff Raff apparently “might” like “diamonds” and “versace”.

Supporting Digital Scholarship

The Tri-Council Agencies (Research councils of Canada) and selected other institutions (going under the rubric TC3+) have released an important Consultation Document titled Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada. You can see a summary blog entry from the CommerceLab, How big data is reshaping the future of digital scholarship in Canada. The document suggest that we have many of the components of a “well-functioning digital infrastructure ecosystem for research and innovation”, but that these are not coordinated and Canada is not keeping up. They propose three initiatives:

  • Establishing a Culture of Stewardship
  • Coordination of Stakeholder Engagement
  • Developing Capacity and Future Funding Parameters

The first initiative is about research data management and something we have been working on the digital humanities for some time. It is great to see a call from our funding agencies.

Tool Discourse

Character Density by Year in Tool DiscourseWe are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.

Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.

“can” 2305
“one” 1996
“text” 1940
“word” 1931
“words” 1859
“program” 1606
“ii” 1514 (Not sure why)
“will” 1361
“language” 1307
“data” 1285
“two” 1188
“system” 1183
“computer” 1116
“used” 1115
“use” 942
“user” 939
“file” 890
“first” 870
“may” 853
“also” 837