Text Mining The Novel 2015


On Thursday and Friday (Oct. 22nd and 23rd) I was at the 2nd workshop for the Text Mining the Novel project. My conference notes are here Text Mining The Novel 2015. We had a number of great papers on the issue of genre (this year’s topic.) Here are some general reflections:

  • The obvious weakness of text mining is that it operates on the novel as text, specifically digital text (or string.) We need to find ways to also study the novel as material object (thing), as a social object, as a performance (of the reader), and as an economic object in a market place. Then we also have to find ways to connect these.
  • So many analytical and mining processes depend on bags of words from dictionaries to topics. Is this a problem or a limitation? Can we try to abstract characters, plot, or argument.
  • I was interested in the philosophical discussions around the epistemological in novels and philosophical claims about language and literature.


NovelTM: Text Mining the Novel

This week SSHRC announced the new partnership grants awarded including one I am a co-investigator on, NovelTM: Text Mining the Novel.

This project brings together researchers and partners from 21 different academic and non-academic institutions to produce the first large-scale quantitative history of the novel. Our aim is to bring new computational approaches in the field of text mining to the study of literature as well as bring the unique knowledge of literary studies to bear on larger debates about data mining and the place of information technology within society.

NovelTM is led by Andrew Piper at McGill University. At the University of Alberta I will be gathering a team that will share the resulting computing methods through TAPoR and developing recipes or tutorials so that others can try them.

Rap Game Riff Raff Textual Analysis

Tyler Trkowski has written a Feature for NOISEY (Music by Vice) on Rap Game Riff Raff Textual Analysis. It is a neat example of text analysis outside the academy. He used Voyant and Many Eyes to analyze Riff Raff’s lyrical canon. (Riff Raff, or Horst Christian Simco, is an eccentric rapper.) What is neat is that they embedded a Voyant word cloud right into their essay along with Word Trees from Many Eyes. Riff Raff apparently “might” like “diamonds” and “versace”.

Supporting Digital Scholarship

The Tri-Council Agencies (Research councils of Canada) and selected other institutions (going under the rubric TC3+) have released an important Consultation Document titled Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada. You can see a summary blog entry from the CommerceLab, How big data is reshaping the future of digital scholarship in Canada. The document suggest that we have many of the components of a “well-functioning digital infrastructure ecosystem for research and innovation”, but that these are not coordinated and Canada is not keeping up. They propose three initiatives:

  • Establishing a Culture of Stewardship
  • Coordination of Stakeholder Engagement
  • Developing Capacity and Future Funding Parameters

The first initiative is about research data management and something we have been working on the digital humanities for some time. It is great to see a call from our funding agencies.

Tool Discourse

Character Density by Year in Tool DiscourseWe are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.

Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.

“can” 2305
“one” 1996
“text” 1940
“word” 1931
“words” 1859
“program” 1606
“ii” 1514 (Not sure why)
“will” 1361
“language” 1307
“data” 1285
“two” 1188
“system” 1183
“computer” 1116
“used” 1115
“use” 942
“user” 939
“file” 890
“first” 870
“may” 853
“also” 837

Digitization Day

The CIRCA Histories and Archives group I am part of is organizing the University of Alberta’s first Digitization Day.

This one-day event is a chance for research projects that are digitizing evidence to meet up with each other and with units on campus that provide relevant research services. Projects that are creating digital archives of different sorts will give short presentations as will units on campus that support research.

The idea is to bring a lot of digitization projects together to learn about each other and what is happening on campus. My sense is that we have hit a critical mass on campus and now that we have a trusted digital repository ERA (Education and Research Archive) it is time to start talking and sharing knowledge. Each project should not have to reinvent itself.

TAPoR portal has moved

The TAPoR Portal has moved to a new server at the University of Alberta. The new location will allow us here to start redesigning it and developing version 2.0. (Or is it now version 3.0?) I underestimated how much work it is to move something so complex. We had to work on bugs, we had to warn users, we had to set up hardware here. Kamal Ranaweera worked very hard to do this – Bravo!

Some links related to the move:

Towards a Methods Commons

Well my vacation is over and I’m facilitating a retreat on text methods across disciplines. (See Towards a Methods Commons.) With support from the ITST program at SSHRC we brought together 15 linguists, philosophers, historians, and literary scholars to discuss methods in a structured way. The goal is to sketch a commons that gathers “recipes” that show people how to do research things with electronic texts. Stay tuned for a draft web site in about 6 months.