Google has announced some cool text projects. See Google AI experiment has you talking to books. One of them, Talk to Books, lets you ask questions or type statements and get answers that are passages from books. This strikes me as a useful research tool as it allows you to see some (book) references that might be useful for defining an issue. The project is somewhat similar to the Veliza tool that we built into Voyant. Veliza is given a particular text and then uses an Eliza-like algorithm to answer you with passages from the text. Needless to say, Talking to Books is far more sophisticated and is not based simply on word searches. Veliza, on the other hand can be reprogrammed and you can specify the text to converse with.
The New York Times has just published a story about How Trump Consultants Exploited the Facebook Data of Millions. The story is about how Cambridge Analytica, the US arm of SCL, a UK company, gathered a massive dataset from Facebook with which to do “psychometric modelling” in order to benefit Trump.
The Guardian has been reporting on Cambridge Analytica for some time – see their Cambridge Analytica Files. The service they are supposed to have provided with this massive dataset was to model types of people and their needs/desires/politics and then help political campaigns, like Trump’s, through microtargeting to influence voters. Using the models a campaign can create content tailored to these psychometrically modelled micro-groups to shift their opinions. (See articles by Paul-Olivier Dehaye about what Cambridge Analytica does and has.)
What is new is that there is a (Canadian) whistleblower from Cambridge Analytica, Christopher Wylie who was willing to talk to the Guardian and others. He is “the data nerd who came in from the cold” and he has a trove of documents that contradict what other said.
The Intercept has a earlier and related story about how Facebook Failed to Protect 30 Million Users From Having Their Data Harvested By Trump Campaign Affiliate. This tells how people were convinced to download a Facebook app that then took your data and that of their friends.
It is difficult to tell how effective the psychometric profiling with data is and if can really be used to sway voters. What is clear, however, is that Facebook is not really protecting their users’ data. To some extent their set up to monetize such psychometric data by convincing those who buy access to the data that you can use it to sway people. The problem is not that it can be done, but that Facebook didn’t get paid for this and are now getting bad press.
The question I want to explore today is this: what do we do about distant reading, now that we know that Franco Moretti, the man who coined the phrase “distant reading,” and who remains its most famous exemplar, is among the men named as a result of the #MeToo movement.
Lauren Klein has posted an important blog entry on Distant Reading after Moretti. This essay is based on a talk delivered at the 2018 MLA convention for a panel on Varieties of Digital Humanities. Klein asks about distant reading and whether it shelters sexual harassment in some way. She asks us to put not just the persons, but the structures of distant reading and the digital humanities under investigation. She suggests that it is “not a coincidence that distant reading does not deal well with gender, or with sexuality, or with race.” One might go further and ask if the same isn’t true of the digital humanities in general or the humanities, for that matter. Klein then suggests some thing we can do about it:
- We need more accessible corpora that better represent the varieties of human experience.
- We need to question our models and ask about what is assumed or hidden.
— DH at USF (@UsfDh) February 27, 2018
Last week I presented a paper based on work that Stéfan Sinclair and I are doing at the University of South Florida. The talk, titled, “Cooking Up Literature: Theorizing Statistical Approaches to Texts” looked at a neglected period of French innovation in the 1970s and 1980s. During this period the French were developing a national corpus, FRANTEXT, while there was also a developing school of exploratory statistics around Jean-Paul Benzécri. While Anglophone humanities computing was concerned with hypertext, the French were looking at using statistical methods like correspondence analysis to explore large corpora. This is long before Moretti and “distant reading.”
The talk was organized by Steven Jones who holds the DeBartolo Chair in Liberal Arts and is a Professor of Digital Humanities. Steven Jones leads a NEH funded project called RECALL that Stéfan and I are consulting on. Jones and colleagues at USF are creating a 3D model of Father Busa’s original factory/laboratory.
Last week I presented a keynote at the Digital Cultures, Big Data and Society conference. (You can seem my conference notes at Digital Cultures Big Data And Society.) The talk I gave was titled “Thinking-Through Big Data in the Humanities” in which I argued that the humanities have the history, skills and responsibility to engage with the topic of big data:
- First, I outlined how the humanities have a history of dealing with big data. As we all know, ideas have histories, and we in the humanities know how to learn from the genesis of these ideas.
- Second, I illustrated how we can contribute by learning to read the new genres of documents and tools that characterize big data discourse.
- And lastly, I turned to the ethics of big data research, especially as it concerns us as we are tempted by the treasures at hand.
Having just finished teaching a course on Big Data and Text Analysis where I taught students Python I can appreciate a well written tutorial on Python. Python Programming for the Humanities by Folgert Karsdorp is a great tutorial for humanists new to programming that takes the form of a series of Jupyter notebooks that students can download. As the tutorials are notebooks, if students have set up Python on their computers then they can use the tutorials interactively. Karsdorp has done a nice job of weaving in cells where the student has to code and Quizes which reinforce the materials which strikes me as an excellent use of the IPython notebook model.
I learned about this reading a more advanced set of tutorials from Allen Riddell for Dariah-DE, Text Analysis with Topic Models for the Humanities and Social Sciences. The title doesn’t do this collection of tutorials justice because they include a lot more than just Topic Models. There are advanced tutorials on all sorts of topics like machine learning and classification. See the index for the range of tutorials.
Text Analysis with Topic Models for the Humanities and Social Sciences (TAToM) consists of a series of tutorials covering basic procedures in quantitative text analysis. The tutorials cover the preparation of a text corpus for analysis and the exploration of a collection of texts using topic models and machine learning.
Stéfan Sinclair and I (mostly Stéfan) have also produced a textbook for teaching programming to humanists called The Art of Literary Text Analysis. These tutorials are also written as Jupyter notebooks so you can download them and play with them.
We are now reimplementing them with our own Voyant-based notebook environment called Spyral. See The Art of Literary Text Analysis with Spyral Notebooks. More on this in another blog entry.
This directory contains 450 novels that appeared between 1770 and 1930 in German, French and English. It is designed for us in teaching and research.
Andrew Piper mentioned a corpus that he put together, txtlab Multilingual Novels. This corpus is of some 450 novels from the late 18th century to the early 20th (1920s). It has a gender mix and is not only English novels. This corpus was supported by SSHRC through the Text Mining the Novel project.
The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. Their collection is not as complete as the Internet Archive, which goes back much further, but it is available in large datasets for research.
The Naylor Report (PDF) about research funding in Canada is out and we put it in Voyant. Here are some different
- Here is the default Corpus View
- Here it is in the Topics (Topic Modelling) View
- Here is the Scatter Plot (Correspondence Analysis) View (see image above)
Domenico Fiormonte has recently blogged about an interesting document he has by Father Busa that relates to a difficult moment in the history of the digital humanities in Italy in 2002. The two page “Conditional Agreement”, which I translate below, was given to Domenico and explained the terms under which Busa would agree to sign a letter to the Minister (of Education and Research) Moratti in response to Moratti’s public statement about the uselessness of humanities informatics. A letter was being prepared to be signed by a large number of Italian (and foreign) academics explaining the value of what we now call the digital humanities. Busa had the connections to get the letter published and taken seriously for which reason Domenico visited him to get his help, which ended up being conditional on certain things being made clear, as laid out in the document. Domenico kept the two pages Busa wrote and recently blogged about them. As he points out in his blog, these two pages are a mini-manifesto of Father Busa’s later views of the place and importance of what he called textual informatics. Domenico also points out how political is the context of these notes and the letter eventually signed and published. Defining the digital humanities is often about positioning the field in the larger academic and public political spheres we operate in.