Michael Sinatra invited me to a “show and tell” workshop at the new Université de Montréal campus where they have a long data wall. Sinatra is the Director of CRIHN (Centre de recherche interuniversitaire sur les humanitiés numériques) and kindly invited me to show what I am doing with Stéfan Sinclair and to see what others at CRIHN and in France are doing.
In early June I was at the Congress for the Humanities and Social Sciences. I took conference notes on the Canadian Society for Digital Humanities 2019 event and on the Canadian Game Studies Association conference, 2019. I was involved in a number of papers:
Exploring through Markup: Recovering COCOA. This paper looked at an experimental Voyant tool that allows one to use COCOA markup as a way of exploring a text in different ways. COCOA markup is a simple form of markup that was superseded by XML languages like those developed with the TEI. The paper recovered some of the history of markup and what we may have lost.
- Our team also had two posters, one on “Generative Ethics: Using AI to Generate” that showed a toy that generates statements about artificial intelligence and ethics. The other, “Discovering Digital Methods: An Exploration of Methodica for Humanists” showed what we are doing with Methodi.ca.
This directory contains 450 novels that appeared between 1770 and 1930 in German, French and English. It is designed for us in teaching and research.
Andrew Piper mentioned a corpus that he put together, txtlab Multilingual Novels. This corpus is of some 450 novels from the late 18th century to the early 20th (1920s). It has a gender mix and is not only English novels. This corpus was supported by SSHRC through the Text Mining the Novel project.
On Thursday and Friday (Oct. 22nd and 23rd) I was at the 2nd workshop for the Text Mining the Novel project. My conference notes are here Text Mining The Novel 2015. We had a number of great papers on the issue of genre (this year’s topic.) Here are some general reflections:
- The obvious weakness of text mining is that it operates on the novel as text, specifically digital text (or string.) We need to find ways to also study the novel as material object (thing), as a social object, as a performance (of the reader), and as an economic object in a market place. Then we also have to find ways to connect these.
- So many analytical and mining processes depend on bags of words from dictionaries to topics. Is this a problem or a limitation? Can we try to abstract characters, plot, or argument.
- I was interested in the philosophical discussions around the epistemological in novels and philosophical claims about language and literature.
This project brings together researchers and partners from 21 different academic and non-academic institutions to produce the first large-scale quantitative history of the novel. Our aim is to bring new computational approaches in the field of text mining to the study of literature as well as bring the unique knowledge of literary studies to bear on larger debates about data mining and the place of information technology within society.
NovelTM is led by Andrew Piper at McGill University. At the University of Alberta I will be gathering a team that will share the resulting computing methods through TAPoR and developing recipes or tutorials so that others can try them.
Tyler Trkowski has written a Feature for NOISEY (Music by Vice) on Rap Game Riff Raff Textual Analysis. It is a neat example of text analysis outside the academy. He used Voyant and Many Eyes to analyze Riff Raff’s lyrical canon. (Riff Raff, or Horst Christian Simco, is an eccentric rapper.) What is neat is that they embedded a Voyant word cloud right into their essay along with Word Trees from Many Eyes. Riff Raff apparently “might” like “diamonds” and “versace”.
The Tri-Council Agencies (Research councils of Canada) and selected other institutions (going under the rubric TC3+) have released an important Consultation Document titled Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada. You can see a summary blog entry from the CommerceLab, How big data is reshaping the future of digital scholarship in Canada. The document suggest that we have many of the components of a “well-functioning digital infrastructure ecosystem for research and innovation”, but that these are not coordinated and Canada is not keeping up. They propose three initiatives:
- Establishing a Culture of Stewardship
- Coordination of Stakeholder Engagement
- Developing Capacity and Future Funding Parameters
The first initiative is about research data management and something we have been working on the digital humanities for some time. It is great to see a call from our funding agencies.
We are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.
Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.
“ii” 1514 (Not sure why)
Google has release a neat new tool that uses their Google Books database. The Google Ngram Viewer lets you plot the relative frequencies of words and phrases over time.
Information about the tool can be found at, http://ngrams.googlelabs.com/info.
The graph above shows truth (blue) graphed against false (red).