Teaching machines to understand – and summarize – text is an article from the The Conversation about the use of machine learning in text summarization. The example they give is how machines could summarize software licenses in ways that would make them more meaningful to us. While these seems a potentially useful application I can’t help wondering why we don’t expect the licensors to summarize their licenses in ways that we can read. Or, barring that, why not make cartoon versions of the agreements like Terms and Conditions.
The issues raised by the use of computers in summarizing texts are many:
What is proposed would only work in a constrained situation like licenses where the machine can be trained to classify text following some sort of training set. It is unlikely to surprise you with poetry (not that it is meant to.)
The idea is introduced with the ultimate goal of reducing all the exabytes of data that we have to deal with. This is the “too much information” trope again. The proposed solution doesn’t really deal with the problems that have beguiled us since we started complaining since part of the problem is too much information of unknown types. That is not to say that machine learning doesn’t have a place, but it won’t solve the underlying problem (again.)
How would the licensors react if we had tools to digest the text we have to deal with? The licensors will have to think about the legal liability (or advantage) of presenting text we won’t read, but which will be summarized for us. They might chose to be opaque to analytics to force us to read for ourselves.
Which raises the question of just what is the problem with too much information? Is it the expectation that we will consume it in some useful way? Is it that we have no time left for just thinking? Is it that we are constantly afraid that someone will have said something important already and we missed it?
A wise colleague asked what it would take for something to change us? Are we open to change when we think of too-much-information as something to be handled? Could machine learning become another wall in the interpretative ghetto we build around us?
The paper I gave discussed the surveillance software Palantir as a story-telling environment. Palantir is designed not to automate intelligence work, but to augment the analyst and provide them a sandbox where they can try stories about groups of people.
On Friday I delivered the opening keynote at an conference Colloque ACFAS 2017 « La publication savante en contexte numérique » organized by CRIHN. The keynote was on “Hermeneutica: Le dialogue du texte et le jeu de l’interprétation,” presenting work Stéfan Sinclair and I have been doing on how to integrate text and tools. The context of the talk was a previous colloquium organized by CRIHN:
Après un premier colloque à l’ACFAS du Centre de Recherche Interuniversitaire sur les Humanités Numériques en 2014 (sur les besoins d’analyser l’impact du numérique sur les sciences humaines), l’objectif de notre colloque en 2017 est de repenser d’un point de vue théorique et pratique l’édition savante à l’époque du numérique.
In the talk I demonstrated a new tool based on Eliza that we call Veliza. Veliza implements Weizenbaum’s Eliza algorithm but adds the ability to pull a random sentence from the text you are analyzing and send that to the machine. The beta version (not the standard one yet) I was using had two other features.
It allows you to ask for things like “the occurrences of father” and it responds with a Voyant panel in the dialogue.
Second, it allows you to edit the script that controls Veliza so you can ask it to respond to different keywords.
This talk was actually the first time we have showed either Veliza or Spiral. Both are still in beta, but will be coming soon to the distribution Voyant.
Thanks to Humanist I came across this project that offers bwFLA: Emulation as a Service. This will become increasingly important in the digital humanities and game studies as more and more content-rich projects become unreadable on contemporary machines. Just think of the CD-ROM. How many of us still have a CD drive on our computer? I think I have a USB drive somewhere … not sure where it is though. Emulation projects like this and MAME are becoming more and more important to preservation and history.
Researchers in the humanities and social sciences are using digital infrastructure to help advance their research as well, and a Canadian-made tool called Voyant is allowing those who work with texts to do it with ease.
The story points out that Voyant may have more unique users than any other tool on Compute Canada, which is gratifying to read. This doesn’t mean more research is supported by Voyant, or more important research; comparisons are not really useful. What is more important is that the way humanists use infrastructure is different and being recognized. Humanists typically aren’t doing “big science.” They don’t need thousands of processors and batch interfaces. They want a more interactive and “always on” type of service. Compute Canada has listened and has been supporting our style/pace of infrastructure. Bravo!
Every year the University of Alberta Libraries organizes a Research Data Management Week to bring faculty, staff, students, and community data specialists together around data management. I was part of an panel session today on the subject. One of the issues we discussed with was how to deal with a likely requirement from funding agencies like SSHRC that Research Data Management Plans be submitted with grants. Some thoughts on this:
Researchers will initially need help understanding what a DMP (Data Management Plan) is. The Portage Network DMP Assistant can help, but many will need an introduction to the issues.
Research universities and libraries will need to develop strategies for supporting projects to meet their new obligations. We will need the infrastructure to match.
There will be push back from some scholarly associations. Others, like CSDH-SCHN will welcome this as we have policies that support the idea.
There is a cost to properly curating, documenting and depositing research data. This cost comes typically at the end of projects when the funds are spent. We will need to do a better job budgeting for data management/deposit.
We need to develop small grants and services for projects to help them go the last mile in curating and depositing their content. At the Kule Institute we developed CRAfT grants in partnership with the UofA Libraries. These grants are meant for prototyping digital archives. Now we need to think about a program to help with the final archiving.
Reading Cartographies of Time by Rosenberg and Grafton, I was struck by one early visual presentation of time by Peter Poitiers. It has both the features of a family tree or genealogy and a timeline. It is spread over pages in a manuscript with text in between vertically flowing lines. there are little portraits of the people. What can we learn from the imaginative designs of past designers of time charts?
This English manuscript was created in the early thirteenth century soon after the death of its author, Peter of Poitiers, theologian and Chancellor of the University of Paris from 1193 to 1205. It is an early copy of his text, the Compendium historiae in genealogia Christi. Intended as a teaching aid, the work provides a visual genealogy of Christ comprised of portraits in roundels, accompanied by a text discussing the historical background of Christ’s lineage.
On May 4th we will be running our annual online Around the World conference. This year the topic is Digital Media in the Post-Truth Era. Anyone can tune in to hear panels talking on this subject from around the world.