Crowdsourcing – Theoreti.ca

Declaration of Independence – First E-Text

Project Gutenberg and the Declaration of Independence

I came across a blog post about how Michael S. Hart, the founder of Project Gutenberg started in 1971 by typing the Declaration of Independence into the ARPANET and sending it to others. See 50 Years at Project Gutenberg.

Forty-Five Years of Digitizing Ebooks: Project Gutenberg’s Practices by Gregory B. Newby is a longer thing on the history of Project Gutenberg’s processes.

Hart passed in 2011. Gregory B. Newby just passed away this October. The Project, however seems to be in good hands with a foundation and board.

The Lives of Literary Characters

The goal of this project is to generate knowledge about the behaviour of literary characters at large scale and make this data openly available to the public. Characters are the scaffolding of great storytelling. This Zooniverse project will allow us to crowdsource data to train AI models to better understand who characters are and what they do within diverse narrative worlds to answer one very big question: why do human beings tell stories?

Today we are going live on Zooinverse with our Citizen Science (crowdsourcing) project, The Lives of Literary Characters. The goal of the project is offer micro-tasks that allow volunteers to annotate literary passages that help annotate training data. It will be interesting to see if we get a decent number of volunteers.

Before setting this up we did some serious reading around the ethics of crowdsourcing as we didn’t want to just exploit readers.

A Bored Chinese Housewife Spent Years Falsifying Russian History on Wikipedia

She “single-handedly invented a new way to undermine Wikipedia,” says a Wikipedian.

From Vice a rather funny story about how A Bored Chinese Housewife Spent Years Falsifying Russian History on Wikipedia. User Zhemao wrote hundreds of linked articles in the Chinese version of the Wikipedia about fictional events, peoples and places in Russian history. Only recently did someone notice. It shows a vulnerability of such crowdsourced resources; a fabulist can create a network of consistent fictions that supporting each other look true.

MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs

Vinay Prabhu, chief scientist at UnifyID, a privacy startup in Silicon Valley, and Abeba Birhane, a PhD candidate at University College Dublin in Ireland, pored over the MIT database and discovered thousands of images labelled with racist slurs for Black and Asian people, and derogatory terms used to describe women. They revealed their findings in a paper undergoing peer review for the 2021 Workshop on Applications of Computer Vision conference.

Another one of those “what were they thinking when they created the dataset stories” from The Register tells about how MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs. The MIT Tiny Images dataset was created automatically using scripts that used the WordNet database of terms which itself held derogatory terms. Nobody thought to check either the terms taken from WordNet or the resulting images scoured from the net. As a result there are not only lots of images for which permission was not secured, but also racists, sexist, and otherwise derogatory labels on the images which in turn means that if you train an AI on these it will generate racist/sexist results.

The article also mentions a general problem with academic datasets. Companies like Facebook can afford to hire actors to pose for images and can thus secure permissions to use the images for training. Academic datasets (and some commercial ones like the Clearview AI database) tend to be scraped and therefore will not have the explicit permission of the copyright holders or people shown. In effect, academics are resorting to mass surveillance to generate training sets. One wonders if we could crowdsource a training set by and for people?

Digital Synergies Launch Event

Today I gave a short talk at the Digital Synergies Launch Event. The launch included neat talks by colleagues including:

Nicolás Arnaez talked about and showed The Lost Garden an audio game project led by Scott Smallwood.
Dr. Rob McMahon & Amanda Almond talked about a neat augmented reality project that engaged with indigenous relations titled We Are All Related AR.
Dr. Astrid Ensslin talked about her recently published book, Approaches to Videogame Discourse .
Yourui Guo talked about the Sounding the Garden app for the multisensory Aga Khan Garden.

I showed and talked about Lexigraphi.ca – The Dictionary of Worlds in the Wild. This is a social site where people can upload pictures of text outside of books and documents and tag the words – text like tatoos, graffiti, store signs and other forms of public textuality.

On the Benefits of Failure 2

Lynne Siemens and Ray Siemens gave the final keynote of the On the Benefits of Failure conference. Their talk was titled “Training Ground for Success? Perspectives on Failure in Several Contexts.”

Continue reading On the Benefits of Failure 2

Transcrire: Crowdsourced Transcription

I just came across a great French project called Transcrire. The Huma-Num Very Large Facility has built a system for the crowdsourcing of transcription of archival materials. It looks like they have built infrastructure for crowdsourcing (or citizen science) in the humanities. Playing around, it looks very professional.

Science 2.0 and Citizen Research

This week I attended the second Science 2.0 conference held in Hamburg, Germany. (You can see my research notes here.) The conference dealt with issues around open access, open data, citizen science, and network enabled science. I was one of two Canadian digital humanists presenting. Matthew Hiebert from the University of Victoria talked about the social edition and work from the Electronic Textual Cultures Lab and Iter. It should be noted that in Europe the word “science” is more inclusive and can include the humanities. This conference wasn’t just about how open data and crowdsourcing could help the natural sciences – it was about how research across the disciplines could be supported with virtual labs and infrastructure.

I gave a paper on “New Publics for the Humanities” that started by noting that the humanities no longer engage the public. The social contract with the public that supports us has been neglected. I worry that if the university is disaggregated and the humanities unbundled from the other faculties (the way newspapers have been hit by the internet and the unbundling of services) then people will stop paying for the humanities and much of the research we do. We will end up with cheaper, research poor, colleges that provide lots of higher education without the research, or climbing walls. Only in the elite private universities will the humanities survive, and in those they will survive as a marker of their class status. You will be able to study ancient languages at elite schools because any degree is good from an elite school provides.

Of course, the humanities will survive outside the university, and may become healthier with the downsizing of the professional (or professorial) humanities, but we run the danger of unthinkingly losing a long tradition of thinking critically and ethically. An irony to be sure – losing thinking traditions through the lack of public reflection on the consequences of disruptive change.

Drawing on Greg Crane, I then argued that citizen research (forms of crowdsourcing) can re-engage the publics we need to support us and reflect with us. Citizen research can provide an alternative way of structuring research in anticipation of defunding of the humanities research function. I illustrated my point by showing a number of examples of humanities crowdsourcing projects from the OED (pre-computer volunteer research) to the Dictionary of Words in the Wild. If I can find the time I will write up the argument to see where it goes.

My talk was followed by thorough one on citizen science in environmental studies by Professor Aletta Bonn of the Citizens create knowledge project – a German platform for citizen science. We need to learn from people like Dr. Bonn who are studying and experimenting with the deployment of citizen research. One point she made was the importance of citizen co-design. Most projects enlist citizens in repetitive micro-tasks designed by researchers. What if the research project were designed from the beginning with citizens? What would that mean? How would that work?