The goal of this project is to generate knowledge about the behaviour of literary characters at large scale and make this data openly available to the public. Characters are the scaffolding of great storytelling. This Zooniverse project will allow us to crowdsource data to train AI models to better understand who characters are and what they do within diverse narrative worlds to answer one very big question: why do human beings tell stories?
Today we are going live on Zooinverse with our Citizen Science (crowdsourcing) project, The Lives of Literary Characters. The goal of the project is offer micro-tasks that allow volunteers to annotate literary passages that help annotate training data. It will be interesting to see if we get a decent number of volunteers.
Before setting this up we did some serious reading around the ethics of crowdsourcing as we didn’t want to just exploit readers.
A top concern for the Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper’s staff.
It remains to be seen what the legalities are. Does using a text in order to train a model constitute the making of a copy in violation of copyright? Does the model contain something equivalent to a copy of the original? These issues are being explored in the AI image generating space where Stability AI is being sued by Getty Images. I hope the New York Times doesn’t just settle quietly before there is a public airing of the issues around the exploitation/ownership of written work. I also note that the Author’s Guild is starting to advocate on behalf of authors,
“It says it’s not fair to use our stuff in your AI without permission or payment,” said Mary Rasenberger, CEO of The Author’s Guild. The non-profit writers’ advocacy organization created the letter, and sent it out to the AI companies on Monday. “So please start compensating us and talking to us.”
This could also have repercussions in academia as many of us scrape the web and social media when studying contemporary issues. For that matter what do we think about the use of our work? One could say that our work, supported as it is by the public, should be fair game from gathering, training and innovative reuse. Aren’t we supported for the public good? Perhaps we should assert that academic prose is available for training models?
Upload datasets, generate reports, and download them in seconds!
OpenAI has just released a plug-in called Code Interpreter which is truly impressive. You need to have ChatGPT Plus to be able to turn it on. It then allows you to upload data and to use plain English to analyze it. You write requests/prompts like:
What are the top 20 content words in this text?
It then interprets your request and describes what it will try to do in Python. Then it generates the Python and runs it. When it has finished, it shows the results. You can see examples in this Medium article:
I’ve been trying to see how I can use it to analyze a text. Here are some of the limitations:
It can’t handle large texts. This can be used to study a book length text, not a collection of books.
It frequently tries to load NLTK or other libraries and then fails. What is interesting is that it then tries other ways of achieving the same goal. For example, I asked for adjectives near the word “nature” and when it couldn’t load the NLTK POS library it then accessed a list of top adjectives in English and searched for those.
It can generate graphs of different sorts, but not interactives.
It is difficult to get the full transcript of an experiment where by “full” I mean that I want the Python code, the prompts, the responses, and any graphs generated. You can ask for a iPython notebook with the code which you can download. Perhaps I can also get a PDF with the images.
The Code Interpreter is in beta so I expect they will be improving it. It is none the less very impressive how it can translate prompts into processes. Particularly impressive is how it tries different approaches when things fail.
Code Interpreter could make data analysis and manipulation much more accessible. Without learning to code you can interrogate a data set and potentially run other processes. It is possible to imagine an unshackled Code Interpreter that could access the internet and do all sorts of things (like running a paper-clip business.)
On Making in the Digital Humanities fills a gap in our understanding of digital humanities projects and craft by exploring the processes of making as much as the products that arise from it. The volume draws focus to the interwoven layers of human and technological textures that constitute digital humanities scholarship.
On Making in the Digital Humanities is finally out from UCL Press. The book honours the work of John Bradley and those in the digital humanities who share their scholarship through projects. Stéfan Sinclair and I first started work on it years ago and were soon joined by Juliane Nyhan and later Alexandra Ortolja-Baird. It is a pleasure to see it finished.
I co-wrote the Introduction with Nyhan and wrote a final chapter on “If Voyant then Spyral: Remembering Stéfan Sinclair: A discourse on practice in the digital humanities.” Stéfan passed during the editing of this.
The genius of of Stéfan Sinclair who passed in August 2020. Voyant was his vision from the time of his dissertation for which he develop HyperPo.
The global team of people involved in Voyant including many graduate research assistants at the U of Alberta. See the About page of Voyant.
How Voyant built on ideas Stéfan and I developed in Hermeneutica about collaborative research as opposed to the inherited solitary paradigm.
In the image above you can see a Spyral code cell that outputs two stacked graphs where the same pattern of words is graphed over two different, but synchronized, corpora. You can thus compare the use of the pattern over time between the two datasets.
Replication as a practice for recovering an understanding of innovative technologies now taken for granted like tokenization or the KWIC. I talked about how Stéfan and I have been replicating important text processing technologies as a way of understanding the history of computing and the digital humanities. Spyral was the environment we developed for documenting our replications.
I then backed up and talked about the epistemological questions about knowledge and knowledge things in the digital age that grew out of and then inspired our experiments in replication. These go back to attempts to think-through tools as knowledge things that bear knowledge in ways that discourse doesn’t. In this context I talked about the DIKW pyramid (data, information, knowledge, wisdom) that captures current views about the relationships between data and knowledge.
Finally I called for help to maintain and extend Voyant/Spyral. I announced the creation of a consortium to bring us together to sustain Voyant.
It was an honour to be able to give the Zampolli lecture on behalf of all the people who have made Voyant such a useful tool.
AI: I am an AI created by OpenAI. How can I help you today?Human: What do you think about the use of the Chinese room argument to defend the claim that a chatbot can never really understand what it is saying?AI: The Chinese room argument is a thought experiment that was first proposed by John Searle.
I can’t help imagining how this could be used by a smart student to write a paper dialogically. One could ask questions, edit the responses, concatenate them, and write some bridging text to get a decent paper. Of course, it might be less work to just write the paper yourself.
Public Resource, a registered nonprofit organization based in California, has created a General Index to scientific journals. The General Index consists of a listing of n-grams, from unigrams to five-grams, extracted from 107 million journal articles.
The General Index is non-consumptive, in that the underlying articles are not released, and it is transformative in that the release consists of the extraction of facts that are derived from that underlying corpus. The General Index is available for free download with no restrictions on use. This is an initial release, and the hope is to improve the quality of text extraction, broaden the scope of the underlying corpus, provide more sophisticated metrics associated with terms, and other enhancements.
Access to the full corpus of scholarly journals is an essential facility to the practice of science in our modern world. The General Index is an invaluable utility for researchers who wish to search for articles about plants, chemicals, genes, proteins, materials, geographical locations, and other entities of interest. The General Index allows scholars and students all over the world to perform specialized and customized searches within the scope of their disciplines and research over the full corpus.
Access to knowledge is a human right and the increase and diffusion of knowledge depends on our ability to stand on the shoulders of giants. We applaud the release of the General Index and look forward to the progress of this worthy endeavor.
There must be some neat uses of this. I wonder if someone like Google might make a diachronic viewer similar to their Google Books Ngram Viewer available?
A short essay I wrote with Stéfan Sinclair on “Recapitulation, Replication, Reanalysis, Repetition, or Revivification” is now up in preprint form. The essay is part of a longer work on “Anatomy of tools: A closer look at ‘textual DH’ methodologies.” The longer work is a set of interventions looking at text tools. These came out of a ADHO SIG-DLS (Digital Literary Studies) workshop that took place in Utrecht in July 2019.
Our intervention at the workshop had the original title “Zombies as Tools: Revivification in Computer Assisted Interpretation” and concentrated on practices of exploring old tools – a sort of revivification or bringing back to life of zombie tools.
All 50,000+ of Trump’s tweets, instantly searchable
Thanks to Kaylin I found the Trump Twitter Archive: TTA – Search. Its a really nice clean site that lets you search or filter Trump’s tweets from when he was elected to when his account was shut down on January 8th, 2021. You can also download the data if you want to try other tools.
I find reading his tweets now to be quite entertaining. Here are two back to back tweets that seems to almost contradict each other. First he boasts about the delivery of vaccines, and then talks about Covid as Fake News!
Jan 3rd 2021 – 8:14:10 AM EST: The number of cases and deaths of the China Virus is far exaggerated in the United States because of @CDCgov’s ridiculous method of determination compared to other countries, many of whom report, purposely, very inaccurately and low. “When in doubt, call it Covid.” Fake News!
Jan 3rd 2021 – 8:05:34 AM EST: The vaccines are being delivered to the states by the Federal Government far faster than they can be administered!