Zampolli Prize Awarded to Voyant Tools

Spyral Notebook Detail (showing code cell and stacked graphs)

Yesterday I gave the triennial Zampolli Prize lecture that honoured Voyant. The lecture is given at the annual ADHO Digital Humanities conference which this year is being hosted by the University of Tokyo. The award notice is here Zampolli Prize Awarded to Voyant Tools. Some of the things I touched on in the talk included:

  • The genius of of Stéfan Sinclair who passed in August 2020. Voyant was his vision from the time of his dissertation for which he develop HyperPo.
  • The global team of people involved in Voyant including many graduate research assistants at the U of Alberta. See the About page of Voyant.
  • How Voyant built on ideas Stéfan and I developed in Hermeneutica about collaborative research as opposed to the inherited solitary paradigm.
  • How we have now developed an extension to Voyant called Spyral. Spyral is a notebook programming environment built on JavaScript. It allows you to document your Voyant explorations, save parameters for corpora and tools, preprocess texts, postprocess results, and create new visualizations. It is, in short, a full data analysis and visualization environment built into Voyant so you can easily call up and explore results in Voyant’s already rich tool set.
  • In the image above you can see a Spyral code cell that outputs two stacked graphs where the same pattern of words is graphed over two different, but synchronized, corpora. You can thus compare the use of the pattern over time between the two datasets.
  • Replication as a practice for recovering an understanding of innovative technologies now taken for granted like tokenization or the KWIC. I talked about how Stéfan and I have been replicating important text processing technologies as a way of understanding the history of computing and the digital humanities. Spyral was the environment we developed for documenting our replications.
  • I then backed up and talked about the epistemological questions about knowledge and knowledge things in the digital age that grew out of and then inspired our experiments in replication. These go back to attempts to think-through tools as knowledge things that bear knowledge in ways that discourse doesn’t. In this context I talked about the DIKW pyramid (data, information, knowledge, wisdom) that captures current views about the relationships between data and knowledge.
  • Finally I called for help to maintain and extend Voyant/Spyral. I announced the creation of a consortium to bring us together to sustain Voyant.

It was an honour to be able to give the Zampolli lecture on behalf of all the people who have made Voyant such a useful tool.

GPT-3 on Searle’s Chinese room argument

AI: I am an AI created by OpenAI. How can I help you today?Human: What do you think about the use of the Chinese room argument to defend the claim that a chatbot can never really understand what it is saying?AI: The Chinese room argument is a thought experiment that was first proposed by John Searle.

Blake Myers has posted a number of conversations they have had with Open AI’s GPT-3, including one titled, GPT-3 on Searle’s Chinese room argumentWhat is intriguing is that Myers has had discussions about specific philosophical issues around AI including the Chinese room argument and GPT-3 appears to have answered coherently. The transcripts or short dialogues are made available and in some cases are not edited.

I can’t help imagining how this could be used by a smart student to write a paper dialogically. One could ask questions, edit the responses, concatenate them, and write some bridging text to get a decent paper. Of course, it might be less work to just write the paper yourself.

Giant, free index to world’s research papers released online

Catalogue of billions of phrases from 107 million papers could ease computerized searching of the literature.

From Ian I learned about a Giant, free index to world’s research papers released online. The General Index, as it is called, makes ngrams of up to 5 words available with pointers to relevant journal articles.

The massive index is available from the Internet Archive here. Here is how it is described.

Public Resource, a registered nonprofit organization based in California, has created a General Index to scientific journals. The General Index consists of a listing of n-grams, from unigrams to five-grams, extracted from 107 million journal articles.

The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.

Access to the full corpus of scholarly journals is an essential facility to the practice of science in our modern world. The General Index is an invaluable utility for researchers who wish to search for articles about plants, chemicals, genes, proteins, materials, geographical locations, and other entities of interest. The General Index allows scholars and students all over the world to perform specialized and customized searches within the scope of their disciplines and research over the full corpus.

Access to knowledge is a human right and the increase and diffusion of knowledge depends on our ability to stand on the shoulders of giants. We applaud the release of the General Index and look forward to the progress of this worthy endeavor.

There must be some neat uses of this. I wonder if someone like Google might make a diachronic viewer similar to their Google Books Ngram Viewer available?

Replication, Repetition, or Revivification

A short essay I wrote with Stéfan Sinclair on “Recapitulation, Replication, Reanalysis, Repetition, or Revivification” is now up in preprint form. The essay is part of a longer work on “Anatomy of tools: A closer look at ‘textual DH’ methodologies.” The longer work is a set of interventions looking at text tools. These came out of a ADHO SIG-DLS (Digital Literary Studies) workshop that took place in Utrecht in July 2019.

Our intervention at the workshop had the original title “Zombies as Tools: Revivification in Computer Assisted Interpretation” and concentrated on practices of exploring old tools – a sort of revivification or bringing back to life of zombie tools.

The full paper should be published soon by DHQ.

Trump Tweet Archive

All 50,000+ of Trump’s tweets, instantly searchable

Thanks to Kaylin I found the Trump Twitter Archive: TTA – Search. Its a really nice clean site that lets you search or filter Trump’s tweets from when he was elected to when his account was shut down on January 8th, 2021. You can also download the data if you want to try other tools.

I find reading his tweets now to be quite entertaining. Here are two back to back tweets that seems to almost contradict each other. First he boasts about the delivery of vaccines, and then talks about Covid as Fake News!

Jan 3rd 2021 – 8:14:10 AM EST: The number of cases and deaths of the China Virus is far exaggerated in the United States because of @CDCgov’s ridiculous method of determination compared to other countries, many of whom report, purposely, very inaccurately and low. “When in doubt, call it Covid.” Fake News!

Jan 3rd 2021 – 8:05:34 AM EST: The vaccines are being delivered to the states by the Federal Government far faster than they can be administered!

ImageGraph: a visual programming language for the Visual Digital Humanities

Leonardo Impett has a nice demonstration here of  ImageGraph: a visual programming language for the Visual Digital Humanities. ImageGraph is a visual programming environment that works with Google Colab. When you have your visual program you can compile it into Python in a Colab notebook and then run that notebook. The visual program is stored in your Github account and the Python code can, of course, be used in larger projects.

The visual programming language has a number of functions for handling images and using artificial intelligence techniques on them. It also has text functions, but they are apparently not fully worked out.

I love the way Impett combines off the shelf systems while adding a nice visual development environment. Very clean.

AI Dungeon and the Ethics of Text Generation

> You talk to yourself.

Hi I’m Rameau. You probably don’t know me, but since you’re reading this I’m going to assume you do indeed exist.

> You [ask why] i’m wondering what consciousness is.

I have no idea. But it’s fun to think about.Is this a person? No, because it’s conscious. Is it alive? No, because it doesn’t respond to external stimuli.

> You [ask] why does a person have to respond to external stimuli.

Because then we can interact with it!!!!!But isn’t consciousness an internal phenomenon? Then how would the person interact with anything else? Oh wait, that’s right. We’re not limited by our perception of reality.

I’ve been trying Latitude’s text generation game AI Dungeon. The game is built on Open AI’s GPT-3. You can see a bit of conversation I had with it in a cyberpunk world.

Latitude has gotten into trouble with Open AI because it seems that the game was generating erotic content featuring children. A number of people turned to AI Dungeon precisely because it could be used to explore adult themes, and that would seem to be a good thing, but then some may have gone too far. See the Wired story It Began as an AI-Fueled Dungeon Game. It Got Much Darker. This raises interesting ethical issues about:

  • Why do so many players use it to generate erotic content?
  • Who is responsible for the erotic content? Open AI, Latitude, or the players?
  • Whether there are ethical reasons to generate erotic content featuring children? Do we forbid people from writing novels like Lolita?
  • How to prevent inappropriate content without crippling the AI? Are filters enough?

The problem of AIs generating toxic language is nicely shown by this web page on Evaluating Neural Toxic Degeneration in Language Models. The interactives and graphs on the page let you see how toxic language can be generated by many of the popular language generation AIs. The problem seems to be the data sets used to train the machines like those that include scrapes of Reddit.

This exploratory tool illustrates research reported on in a paper titled RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. You can see a neat visualization of the connected papers here.

Celebrating Stéfan Sinclair: A Dialogue from 2007

Sadly, last Thursday Stéfan Sinclair passed away. A group of us posted an obituary for CSDH-SCHN here,  Stéfan Sinclair, In Memoriam and boy do I miss him already. While the obituary describes the arc of his career I’ve been trying to think of how to celebrate how he loved to play with ideas and code. The obituary tells the what of his life but doesn’t show the how.

You see, Stéfan loved to toy with ideas of text through the development of software toys. The hermeneuti.ca project started with a one day text analysis vacation/hackathon. We decided to leave all the busy work of being an academic in our offices, and spend a day in the TAPoR lab at McMaster. We decided to mess around and try the analytical equivalent of extreme programming. That included a version of “pair programming” where we alternated one at the keyboard doing the analysis while the other would take notes and direct. We told ourselves we would just devote one day without interruptions to this folly and see if together we could take a project from conception to some sort of finished result in a day.

Little did we know we would still be at play right until a few weeks ago. We failed to finish that day, but we got far enough to know we enjoyed the fooling around enough to do it again and again. Those escapes into what we later called agile hermeneutics, to give it a serious name, eventually led to a monster of a project that reflected back on the play. The project culminated in the jointly authored book Hermeneutica (MIT Press, 2016) and Voyant 2.0, both of which tried to not only think-through some of the potential of the play, but also give others a way of making their own interpretative toys (which we called hermeneutica). But these too are perhaps too serious to commemorate Stéfan’s presence.

Which brings me to the dialogue we wrote and performed on “Reading Tools.” Thanks to Susan I was reminded of this script that we acted out at the University of Illinois, Urbana-Champaign in June of 2007. May it honour how Stéfan would want to be remembered. Imagine him smiling at the front of the room as he starts,

Sinclair: Why do we care so much for the opinions of other humanists? Why do we care so much whether they use computing in the humanities?

Rockwell: Let me tell you an old story. There was once a titan who invented an interpretative technology for his colleagues. No, … he wasn’t chained to a rock to have his liver chewed out daily. … Instead he did the smart thing and brought it to his dean, convinced the technology would free his colleagues from having to interpret texts and let them get back to the real work of thinking.

Sinclair: I imagine his dean told him that in the academy those who develop tools are not the best judges of their inventions and that he had to get his technology reviewed as if it were a book.

Rockwell: Exactly, and the dean said, “And in this instance, you who are the father of a text technology, from a paternal love of your own children have been led to attribute to them a quality which they cannot have; for this discovery of yours will create forgetfulness in the learners’ souls, because they will not study the old ways; they will trust to the external tools and not interpret for themselves. The technology which you have discovered is an aid not to interpretation, but to online publishing.”

Sinclair: Yes, Geoffrey, you can easily tell jokes about the academy, paraphrasing Socrates, but we aren’t outside the city walls of Athens, but in the middle of Urbana at a conference. We have a problem of audience – we are slavishly trying to please the other – that undigitized humanist – why don’t we build just for ourselves? …

Enjoy the full dialogue here: Reading Tools Script (PDF).

Google Developers Blog: Text Embedding Models Contain Bias. Here’s Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we’ll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.

On the Google Developvers Blog there is an interesting post on Text Embedding Models Contain Bias. Here’s Why That Matters. The post talks about a technique for using Word Embedding Association Tests (WEAT) to see compare different text embedding algorithms. The idea is to see whether groups of words like gendered words associate with positive or negative words. In the image above you can see the sentiment bias for female and male names for different techniques.

While Google is working on WEAT to try to detect and deal with bias, in our case this technique could be used to identify forms of bias in corpora.