Text Analysis – Page 3

AI Dungeon and the Ethics of Text Generation

> You talk to yourself.

Hi I’m Rameau. You probably don’t know me, but since you’re reading this I’m going to assume you do indeed exist.

> You [ask why] i’m wondering what consciousness is.

I have no idea. But it’s fun to think about.Is this a person? No, because it’s conscious. Is it alive? No, because it doesn’t respond to external stimuli.

> You [ask] why does a person have to respond to external stimuli.

Because then we can interact with it!!!!!But isn’t consciousness an internal phenomenon? Then how would the person interact with anything else? Oh wait, that’s right. We’re not limited by our perception of reality.

I’ve been trying Latitude’s text generation game AI Dungeon. The game is built on Open AI’s GPT-3. You can see a bit of conversation I had with it in a cyberpunk world.

Latitude has gotten into trouble with Open AI because it seems that the game was generating erotic content featuring children. A number of people turned to AI Dungeon precisely because it could be used to explore adult themes, and that would seem to be a good thing, but then some may have gone too far. See the Wired story It Began as an AI-Fueled Dungeon Game. It Got Much Darker. This raises interesting ethical issues about:

Why do so many players use it to generate erotic content?
Who is responsible for the erotic content? Open AI, Latitude, or the players?
Whether there are ethical reasons to generate erotic content featuring children? Do we forbid people from writing novels like Lolita?
How to prevent inappropriate content without crippling the AI? Are filters enough?

The problem of AIs generating toxic language is nicely shown by this web page on Evaluating Neural Toxic Degeneration in Language Models. The interactives and graphs on the page let you see how toxic language can be generated by many of the popular language generation AIs. The problem seems to be the data sets used to train the machines like those that include scrapes of Reddit.

This exploratory tool illustrates research reported on in a paper titled RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. You can see a neat visualization of the connected papers here.

Celebrating Stéfan Sinclair: A Dialogue from 2007

Sadly, last Thursday Stéfan Sinclair passed away. A group of us posted an obituary for CSDH-SCHN here, Stéfan Sinclair, In Memoriam and boy do I miss him already. While the obituary describes the arc of his career I’ve been trying to think of how to celebrate how he loved to play with ideas and code. The obituary tells the what of his life but doesn’t show the how.

You see, Stéfan loved to toy with ideas of text through the development of software toys. The hermeneuti.ca project started with a one day text analysis vacation/hackathon. We decided to leave all the busy work of being an academic in our offices, and spend a day in the TAPoR lab at McMaster. We decided to mess around and try the analytical equivalent of extreme programming. That included a version of “pair programming” where we alternated one at the keyboard doing the analysis while the other would take notes and direct. We told ourselves we would just devote one day without interruptions to this folly and see if together we could take a project from conception to some sort of finished result in a day.

Little did we know we would still be at play right until a few weeks ago. We failed to finish that day, but we got far enough to know we enjoyed the fooling around enough to do it again and again. Those escapes into what we later called agile hermeneutics, to give it a serious name, eventually led to a monster of a project that reflected back on the play. The project culminated in the jointly authored book Hermeneutica (MIT Press, 2016) and Voyant 2.0, both of which tried to not only think-through some of the potential of the play, but also give others a way of making their own interpretative toys (which we called hermeneutica). But these too are perhaps too serious to commemorate Stéfan’s presence.

Which brings me to the dialogue we wrote and performed on “Reading Tools.” Thanks to Susan I was reminded of this script that we acted out at the University of Illinois, Urbana-Champaign in June of 2007. May it honour how Stéfan would want to be remembered. Imagine him smiling at the front of the room as he starts,

Sinclair: Why do we care so much for the opinions of other humanists? Why do we care so much whether they use computing in the humanities?

Rockwell: Let me tell you an old story. There was once a titan who invented an interpretative technology for his colleagues. No, … he wasn’t chained to a rock to have his liver chewed out daily. … Instead he did the smart thing and brought it to his dean, convinced the technology would free his colleagues from having to interpret texts and let them get back to the real work of thinking.

Sinclair: I imagine his dean told him that in the academy those who develop tools are not the best judges of their inventions and that he had to get his technology reviewed as if it were a book.

Rockwell: Exactly, and the dean said, “And in this instance, you who are the father of a text technology, from a paternal love of your own children have been led to attribute to them a quality which they cannot have; for this discovery of yours will create forgetfulness in the learners’ souls, because they will not study the old ways; they will trust to the external tools and not interpret for themselves. The technology which you have discovered is an aid not to interpretation, but to online publishing.”

Sinclair: Yes, Geoffrey, you can easily tell jokes about the academy, paraphrasing Socrates, but we aren’t outside the city walls of Athens, but in the middle of Urbana at a conference. We have a problem of audience – we are slavishly trying to please the other – that undigitized humanist – why don’t we build just for ourselves? …

Enjoy the full dialogue here: Reading Tools Script (PDF).

Google Developers Blog: Text Embedding Models Contain Bias. Here’s Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we’ll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.

On the Google Developvers Blog there is an interesting post on Text Embedding Models Contain Bias. Here’s Why That Matters. The post talks about a technique for using Word Embedding Association Tests (WEAT) to see compare different text embedding algorithms. The idea is to see whether groups of words like gendered words associate with positive or negative words. In the image above you can see the sentiment bias for female and male names for different techniques.

While Google is working on WEAT to try to detect and deal with bias, in our case this technique could be used to identify forms of bias in corpora.

260,000 Words, Full of Self-Praise, From Trump on the Virus

The New York Times has a nice content analysis study of Trump’s Coronavirus briefings, 260,000 Words, Full of Self-Praise, From Trump on the Virus. They tagged the corpus for different types of utterances including:

Self-congratulations
Exaggerations and falsehoods
Displays of empathy or appeals to national unity
Blaming others
Credits others

Needless to say they found he spent a fair amount of time congratulating himself.

They then created a neat visualizations with colour coded sections showing where he shows empathy or congratulates himself.

According to the article they looked at 42 briefings or other remarks from March 9 to April 17, 2020 giving them a total of 260,000 words.

I decided to replicate their study with Voyant and I gathered 29 Coronavirus Task Force Briefings (and one Press Conference) from February 29 to April 17. These are all the Task Force Briefings I could find at the White House web site. The corpus has 418,775 words, but those include remarks by people other than Trump, questions, and metadata.

Some of the things that struck me are the absence of medical terminology in the high frequency words. I was also intrigued by the prominence of “going to”. Trump spends a fair amount of time talking about what he and others are going to be doing rather than what is done. Here you have a Contexts panel from Voyant.

Embedded Voyant panel

This post is a demonstration of how a Voyant panel or hermeneutica can be embedded in a WordPress post. See our Voyant tutorials at dialogi.ca.

To embed the panel I created a custom HTML block. In it I pasted the <iframe> element exported from the Voyant panel I wanted. While editing I see the HTML code, when I Preview (either the block or the whole post) or publish then I see the Voyant panel in place. Try playing with it!

Welcome to Dialogica: Thinking-Through Voyant!

Do you need online teaching ideas and materials? Dialogica was supposed to be a text book, but instead we are adapting it for use in online learning and self-study. It is shared here under a CC BY 4.0 license so you can adapt as needed.

Stéfan Sinclair and I have put up a web site with tutorial materials for learning Voyant. See Dialogi.ca: Thinking-Through Voyant!.

Dialogica (http://dialogi.ca) plays with the idea of learning through a dialogue. A dialogue with the text; a dialogue mediated by the tool; and a dialogue with instructors like us.

Dialogica is made up of a set of tutorials that students should be able to alone or with minimal support. These are Word documents that you (instructors) can edit to suit your teaching and we are adding to them. We have added a gloss of teaching notes. Later we plan to add Spyral notebooks that go into greater detail on technical subjects, including how to program in Spyral.

Dialogica is made available with a CC BY 4.0 license so you can do what you want with it as long as you give us some sort of credit.

Show and Tell at CHRIN

Stéphane Pouyllau’s photo of me presenting

Michael Sinatra invited me to a “show and tell” workshop at the new Université de Montréal campus where they have a long data wall. Sinatra is the Director of CRIHN (Centre de recherche interuniversitaire sur les humanitiés numériques) and kindly invited me to show what I am doing with Stéfan Sinclair and to see what others at CRIHN and in France are doing.

Continue reading Show and Tell at CHRIN

Conference notes for CSDH 2019

In early June I was at the Congress for the Humanities and Social Sciences. I took conference notes on the Canadian Society for Digital Humanities 2019 event and on the Canadian Game Studies Association conference, 2019. I was involved in a number of papers:

Exploring through Markup: Recovering COCOA. This paper looked at an experimental Voyant tool that allows one to use COCOA markup as a way of exploring a text in different ways. COCOA markup is a simple form of markup that was superseded by XML languages like those developed with the TEI. The paper recovered some of the history of markup and what we may have lost.
Designing for Sustainability: Maintaining TAPoR and Methodi.ca. This paper was presented by Holly Pickering and discussed the processes we have set up to maintain TAPoR and Methodi.ca.
Our team also had two posters, one on “Generative Ethics: Using AI to Generate” that showed a toy that generates statements about artificial intelligence and ethics. The other, “Discovering Digital Methods: An Exploration of Methodica for Humanists” showed what we are doing with Methodi.ca.

JSTOR Text Analyzer

JSTOR, and some other publishers of electronic research, have started building text analysis tools into their publishing tools. I came across this at the end of a JSTOR article where there was a link to “Get more results on Text Analyzer” which leads to a beta of the JSTOR labs Text Analyzer environment.

This analyzer environment provides simple an analytical tools for surveying an issue of a journal or article. The emphasis is on extracting keywords and entities so that one can figure out if an article or journal is useful. One can use this to find other similar things.

What intrigues me is this embedding of tools into reading environments which is different from the standard separate data and tools model. I wonder how we could instrument Voyant so that it could be more easily embedded in other environments.