Google Developers Blog: Text Embedding Models Contain Bias. Here’s Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we’ll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.

On the Google Developvers Blog there is an interesting post on Text Embedding Models Contain Bias. Here’s Why That Matters. The post talks about a technique for using Word Embedding Association Tests (WEAT) to see compare different text embedding algorithms. The idea is to see whether groups of words like gendered words associate with positive or negative words. In the image above you can see the sentiment bias for female and male names for different techniques.

While Google is working on WEAT to try to detect and deal with bias, in our case this technique could be used to identify forms of bias in corpora.

The Viral Virus

Graph of word "test*" over time
Relative Frequency of word “test*” over time

Analyzing the Twitter Conversation Surrounding COVID-19

From Twitter I found out about this excellent visual essay on The Viral Virus by Kate Appel from May 6, 2020. Appel used Voyant to study highly retweeted tweets from January 20th to April 23rd. She divided the tweets into weeks and then used the distinctive words (tf-idf) tool to tell a story about the changing discussion about Covid-19. As you scroll down you see lists of distinctive words and supporting images. At the end she shows some of the topics gained from topic modelling. It is a remarkably simple, but effective use of Voyant.

260,000 Words, Full of Self-Praise, From Trump on the Virus

The New York Times has a nice content analysis study of Trump’s Coronavirus briefings, 260,000 Words, Full of Self-Praise, From Trump on the Virus. They tagged the corpus for different types of utterances including:

  • Self-congratulations
  • Exaggerations and falsehoods
  • Displays of empathy or appeals to national unity
  • Blaming others
  • Credits others

Needless to say they found he spent a fair amount of time congratulating himself.

They then created a neat visualizations with colour coded sections showing where he shows empathy or congratulates himself.

According to the article they looked at 42 briefings or other remarks from March 9 to April 17, 2020 giving them a total of 260,000 words.

I decided to replicate their study with Voyant and I gathered 29 Coronavirus Task Force Briefings (and one Press Conference) from February 29 to April 17. These are all the Task Force Briefings I could find at the White House web site. The corpus has 418,775 words, but those include remarks by people other than Trump, questions, and metadata.

Some of the things that struck me are the absence of medical terminology in the high frequency words. I was also intrigued by the prominence of “going to”. Trump spends a fair amount of time talking about what he and others are going to be doing rather than what is done. Here you have a Contexts panel from Voyant.

Embedded Voyant panel

This post is a demonstration of how a Voyant panel or hermeneutica can be embedded in a WordPress post. See our Voyant tutorials at dialogi.ca.

To embed the panel I created a custom HTML block. In it I pasted the <iframe> element exported from the Voyant panel I wanted. While editing I see the HTML code, when I Preview (either the block or the whole post) or publish then I see the Voyant panel in place. Try playing with it!

Welcome to Dialogica: Thinking-Through Voyant!

Do you need online teaching ideas and materials? Dialogica was supposed to be a text book, but instead we are adapting it for use in online learning and self-study. It is shared here under a CC BY 4.0 license so you can adapt as needed.

Stéfan Sinclair and I have put up a web site with tutorial materials for learning Voyant. See Dialogi.ca: Thinking-Through Voyant!.

Dialogica (http://dialogi.ca) plays with the idea of learning through a dialogue. A dialogue with the text; a dialogue mediated by the tool; and a dialogue with instructors like us.

Dialogica is made up of a set of tutorials that students should be able to alone or with minimal support. These are Word documents that you (instructors) can edit to suit your teaching and we are adding to them. We have added a gloss of teaching notes. Later we plan to add Spyral notebooks that go into greater detail on technical subjects, including how to program in Spyral.

Dialogica is made available with a CC BY 4.0 license so you can do what you want with it as long as you give us some sort of credit.

Show and Tell at CHRIN


Stéphane Pouyllau’s photo of me presenting

Michael Sinatra invited me to a “show and tell” workshop at the new Université de Montréal campus where they have a long data wall. Sinatra is the Director of CRIHN (Centre de recherche interuniversitaire sur les humanitiés numériques) and kindly invited me to show what I am doing with Stéfan Sinclair and to see what others at CRIHN and in France are doing.

Continue reading Show and Tell at CHRIN

Conference notes for CSDH 2019

In early June I was at the Congress for the Humanities and Social Sciences. I took conference notes on the Canadian Society for Digital Humanities 2019 event and on the Canadian Game Studies Association conference, 2019. I was involved in a number of papers:

  • Exploring through Markup: Recovering COCOA. This paper looked at an experimental Voyant tool that allows one to use COCOA markup as a way of exploring a text in different ways. COCOA markup is a simple form of markup that was superseded by XML languages like those developed with the TEI. The paper recovered some of the history of markup and what we may have lost.

  • Designing for Sustainability: Maintaining TAPoR and Methodi.ca. This paper was presented by Holly Pickering and discussed the processes we have set up to maintain TAPoR and Methodi.ca.

  • Our team also had two posters, one on “Generative Ethics: Using AI to Generate” that showed a toy that generates statements about artificial intelligence and ethics. The other, “Discovering Digital Methods: An Exploration of Methodica for Humanists” showed what we are doing with Methodi.ca.

JSTOR Text Analyzer

JSTOR, and some other publishers of electronic research, have started building text analysis tools into their publishing tools. I came across this at the end of a JSTOR article where there was a link to “Get more results on Text Analyzer” which leads to a beta of the JSTOR labs Text Analyzer environment.

JSTOR Labs Text Analyzer

This analyzer environment provides simple an analytical tools for surveying an issue of a journal or article. The emphasis is on extracting keywords and entities so that one can figure out if an article or journal is useful. One can use this to find other similar things.

Results of Text Analyzer

What intrigues me is this embedding of tools into reading environments which is different from the standard separate data and tools model. I wonder how we could instrument Voyant so that it could be more easily embedded in other environments.

The Secret History of Women in Coding

Computer programming once had much better gender balance than it does today. What went wrong?

The New York Times has a nice long article on The Secret History of Women in Coding – The New York TimesWe know a lot of the story from books like Campbell-Kelly’s From Airline Reservations to Sonic the Hedgehog: a History of the Software Industry (2003), Chang’s Brotopia (2018), and Rankin’s A People’s History of Computing in the United States (2018).

The history is not the heroic story of personal computing that I was raised on. It is a story of how women were driven out of computing (both the academy and businesses) starting in the 1960s.

A group of us at the U of Alberta are working on archiving the work of Sally Sedelow, one of the forgotten pioneers of humanities computing. Dr. Sedelow got her PhD in English in 1960 and did important early work on text analysis systems.

Word2Vec Vis of Pride and Prejudice

Paolo showed me a neat demonstration of Word2Vec Vis of Pride and PrejudiceLynn Cherny trained a Word2Vec model using Jane Austen’s novels and then used that to find close matches for key words. She then show the text of a novel with the words replaced by their match in the language of Austen. It serves as a sort of demonstration of how Word2Vec works.