Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we’ll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.
On the Google Developvers Blog there is an interesting post on Text Embedding Models Contain Bias. Here’s Why That Matters. The post talks about a technique for using Word Embedding Association Tests (WEAT) to see compare different text embedding algorithms. The idea is to see whether groups of words like gendered words associate with positive or negative words. In the image above you can see the sentiment bias for female and male names for different techniques.
While Google is working on WEAT to try to detect and deal with bias, in our case this technique could be used to identify forms of bias in corpora.
Analyzing the Twitter Conversation Surrounding COVID-19
From Twitter I found out about this excellent visual essay on The Viral Virus by Kate Appel from May 6, 2020. Appel used Voyant to study highly retweeted tweets from January 20th to April 23rd. She divided the tweets into weeks and then used the distinctive words (tf-idf) tool to tell a story about the changing discussion about Covid-19. As you scroll down you see lists of distinctive words and supporting images. At the end she shows some of the topics gained from topic modelling. It is a remarkably simple, but effective use of Voyant.
Some of the things that struck me are the absence of medical terminology in the high frequency words. I was also intrigued by the prominence of “going to”. Trump spends a fair amount of time talking about what he and others are going to be doing rather than what is done. Here you have a Contexts panel from Voyant.
This post is a demonstration of how a Voyant panel or hermeneutica can be embedded in a WordPress post. See our Voyant tutorials at dialogi.ca.
To embed the panel I created a custom HTML block. In it I pasted the <iframe> element exported from the Voyant panel I wanted. While editing I see the HTML code, when I Preview (either the block or the whole post) or publish then I see the Voyant panel in place. Try playing with it!
Do you need online teaching ideas and materials? Dialogica was supposed to be a text book, but instead we are adapting it for use in online learning and self-study. It is shared here under a CC BY 4.0 license so you can adapt as needed.
Dialogica (http://dialogi.ca) plays with the idea of learning through a dialogue. A dialogue with the text; a dialogue mediated by the tool; and a dialogue with instructors like us.
Dialogica is made up of a set of tutorials that students should be able to alone or with minimal support. These are Word documents that you (instructors) can edit to suit your teaching and we are adding to them. We have added a gloss of teaching notes. Later we plan to add Spyral notebooks that go into greater detail on technical subjects, including how to program in Spyral.
Dialogica is made available with a CC BY 4.0 license so you can do what you want with it as long as you give us some sort of credit.
Michael Sinatra invited me to a “show and tell” workshop at the new Université de Montréal campus where they have a long data wall. Sinatra is the Director of CRIHN (Centre de recherche interuniversitaire sur les humanitiés numériques) and kindly invited me to show what I am doing with Stéfan Sinclair and to see what others at CRIHN and in France are doing.
Exploring through Markup: Recovering COCOA. This paper looked at an experimental Voyant tool that allows one to use COCOA markup as a way of exploring a text in different ways. COCOA markup is a simple form of markup that was superseded by XML languages like those developed with the TEI. The paper recovered some of the history of markup and what we may have lost.
Designing for Sustainability: Maintaining TAPoR and Methodi.ca. This paper was presented by Holly Pickering and discussed the processes we have set up to maintain TAPoR and Methodi.ca.
Our team also had two posters, one on “Generative Ethics: Using AI to Generate” that showed a toy that generates statements about artificial intelligence and ethics. The other, “Discovering Digital Methods: An Exploration of Methodica for Humanists” showed what we are doing with Methodi.ca.
JSTOR, and some other publishers of electronic research, have started building text analysis tools into their publishing tools. I came across this at the end of a JSTOR article where there was a link to “Get more results on Text Analyzer” which leads to a beta of the JSTOR labs Text Analyzer environment.
This analyzer environment provides simple an analytical tools for surveying an issue of a journal or article. The emphasis is on extracting keywords and entities so that one can figure out if an article or journal is useful. One can use this to find other similar things.
What intrigues me is this embedding of tools into reading environments which is different from the standard separate data and tools model. I wonder how we could instrument Voyant so that it could be more easily embedded in other environments.
The history is not the heroic story of personal computing that I was raised on. It is a story of how women were driven out of computing (both the academy and businesses) starting in the 1960s.
A group of us at the U of Alberta are working on archiving the work of Sally Sedelow, one of the forgotten pioneers of humanities computing. Dr. Sedelow got her PhD in English in 1960 and did important early work on text analysis systems.
Paolo showed me a neat demonstration of Word2Vec Vis of Pride and Prejudice. Lynn Cherny trained a Word2Vec model using Jane Austen’s novels and then used that to find close matches for key words. She then show the text of a novel with the words replaced by their match in the language of Austen. It serves as a sort of demonstration of how Word2Vec works.