Word2Vec Vis of Pride and Prejudice

Paolo showed me a neat demonstration of Word2Vec Vis of Pride and PrejudiceLynn Cherny trained a Word2Vec model using Jane Austen’s novels and then used that to find close matches for key words. She then show the text of a novel with the words replaced by their match in the language of Austen. It serves as a sort of demonstration of how Word2Vec works.

Every Noise at Once

Ted Underwood in a talk at the Novel Worlds conference talked about a fascinating project,  Every Noise at OnceThis project has tried to map the genres of music so you can explore these by clicking and listening. You should, in theory, be able to tell the difference between “german techno” and “diva house” by listening. (I’m not musically literate enough to.)

The structure of recent philosophy (II) · Visualizations

In this codebook we will investigate the macro-structure of philosophical literature. As a base for our investigation I have collected about fifty-thousand reco

Stéfan sent me a link to this interesting post, The structure of recent philosophy (II) · Visualizations. Maximilian Noichl has done a fascinating job using the Web of Science to develop a model of the field of Philosophy since the 1950s. In this post he describes his method and the resulting visualization of clusters (see above). In a later post (version III of the project) he gets a more nuanced visualization that seems more true to the breadth of what people do in philosophy. The version above is heavily weighted to anglo-american analytic philosophy while version III has more history of philosophy and continental philosophy.

Here is the final poster (PDF) for version III.

I can’t help wondering if his snowball approach doesn’t bias the results. What if one used full text of major journals?

Writing with the machine

“…it’s like writing with a deranged but very well-read parrot on your shoulder.”

Robin Sloan of Mr. Penumbra’s 24-hour Bookstore fame, has been talking about Writing with the machine. He was inspired by presentations like Adrej Karpathy’s blog post on The Unreasonable Effectiveness of Recurrent Neural Networks and Bowman et al’s Generating Sentences from a Continuous Space to try developing a neural net that could generate text. He used as a training corpus a collection of early science-fiction from the Internet Archive and created different text generation tools like the short video of that which you can see above and hear explained in this Eyeo video.

One of the points he emphasizes is that he didn’t do this just for the fun of seeing strange phrases generated, but wants to use it seriously as a writing aide.

I can’t help wondering if this could be used philosophically. Could we generate philosophical or ethical phrases in response to questions?

CSDH and CGSA 2018

This year we had busy CSDH and CGSA meetings at Congress 2018 in Regina. My conference notes are here. Some of the papers I was involved in include:

CSDH-SCHN:

  • “Code Notebooks: New Tools for Digital Humanists” was presented by Kynan Ly and made the case for notebook-style programming in the digital humanities.
  • “Absorbing DiRT: Tool Discovery in the Digital Age” was presented by Kaitlyn Grant. The paper made the case for tool discovery registries and explained the merger of DiRT and TAPoR.
  • “Splendid Isolation: Big Data, Correspondence Analysis and Visualization in France” was presented by me. The paper talked about FRANTEXT and correspondence analysis in France in the 1970s and 1980s. I made the case that the French were doing big data and text mining long before we were in the Anglophone world.
  • “TATR: Using Content Analysis to Study Twitter Data” was a poster presented by Kynan Ly, Robert Budac, Jason Bradshaw and Anthony Owino. It showed IPython notebooks for analyzing Twitter data.
  • “Climate Change and Academia – Joint Panel with ESAC” was a panel I was on that focused on alternatives to flying for academics.

CGSA:

  • “Archiving an Untold History” was presented by Greg Whistance-Smith. He talked about our project to archive John Szczepaniak’s collection of interviews with Japanese game designers.
  • “Using Salience to Study Twitter Corpora” was presented by Robert Budac who talked about different algorithms for finding salient words in a Twitter corpus.
  • “Political Mobilization in the GG Community” was presented by ZP who talked about a study of a Twitter corpus that looked at the politics of the community.

Also, a PhD student I’m supervising, Sonja Sapach, won the CSDH-SCHN (Canadian Society for Digital Humanities) Ian Lancashire Award for Graduate Student Promise at CSDHSCHN18 at Congress. The Award “recognizes an outstanding presentation at our annual conference of original research in DH by a graduate student.” She won the award for a paper on “Tagging my Tears and Fears: Text-Mining the Autoethnography.” She is completing an interdisciplinary PhD in Sociology and Digital Humanities. Bravo Sonja!

Re-Imagining Education In An Automating World conference at George Brown

On May 25th I had a chance to attend a gem of a conference organized the Philosophy of Education (POE) committee at George Brown. They organized a conference with different modalities from conversations to formal talks to group work. The topic was Re-Imagining Education in An Automating World (see my conference notes here) and this conference is a seed for a larger one next year.

I gave a talk on Digital Citizenship at the end of the day where I tried to convince people that:

  • Data analytics are now a matter of citizenship (we all need to understand how we are being manipulated).
  • We therefore need to teach data literacy in the arts and humanities, so that
  • Students are prepared to contribute to and critique the ways analytics are used deployed.
  • This can be done by integrating data and analytical components in any course using field-appropriate data.

 

Too Much Information and the KWIC

A paper that Stéfan Sinclair and wrote about Peter Luhn and the Keyword-in-Context (KWIC) has just been published by the Fudan Journal of the Humanities and Social Sciences, Too Much Information and the KWIC | SpringerLink. The paper is part of a series that replicates important innovations in text technology, in this case, the development of the KWIC by Peter Luhn at IBM. We use that as a moment to reflect on the datafication of knowledge after WW II, drawing on Lyotard.

Google AI experiment has you talking to books

Google has announced some cool text projects. See Google AI experiment has you talking to books. One of them, Talk to Books, lets you ask questions or type statements and get answers that are passages from books. This strikes me as a useful research tool as it allows you to see some (book) references that might be useful for defining an issue. The project is somewhat similar to the Veliza tool that we built into Voyant. Veliza is given a particular text and then uses an Eliza-like algorithm to answer you with passages from the text. Needless to say, Talking to Books is far more sophisticated and is not based simply on word searches. Veliza, on the other hand can be reprogrammed and you can specify the text to converse with.

Continue reading Google AI experiment has you talking to books

How Trump Consultants Exploited the Facebook Data of Millions

Cambridge Analytica harvested personal information from a huge swath of the electorate to develop techniques that were later used in the Trump campaign.

The New York Times has just published a story about How Trump Consultants Exploited the Facebook Data of MillionsThe story is about how Cambridge Analytica, the US arm of SCL, a UK company, gathered a massive dataset from Facebook with which to do “psychometric modelling” in order to benefit Trump.

The Guardian has been reporting on Cambridge Analytica for some time – see their Cambridge Analytica Files. The service they are supposed to have provided with this massive dataset was to model types of people and their needs/desires/politics and then help political campaigns, like Trump’s, through microtargeting to influence voters. Using the models a campaign can create content tailored to these psychometrically modelled micro-groups to shift their opinions. (See articles by Paul-Olivier Dehaye about what Cambridge Analytica does and has.)

What is new is that there is a (Canadian) whistleblower from Cambridge Analytica, Christopher Wylie who was willing to talk to the Guardian and others. He is “the data nerd who came in from the cold” and he has a trove of documents that contradict what other said.

The Intercept has a earlier and related story about how Facebook Failed to Protect 30 Million Users From Having Their Data Harvested By Trump Campaign Affiliate. This tells how people were convinced to download a Facebook app that then took your data and that of their friends.

It is difficult to tell how effective the psychometric profiling with data is and if can really be used to sway voters. What is clear, however, is that Facebook is not really protecting their users’ data. To some extent their set up to monetize such psychometric data by convincing those who buy access to the data that you can use it to sway people. The problem is not that it can be done, but that Facebook didn’t get paid for this and are now getting bad press.

Distant Reading after Moretti

The question I want to explore today is this: what do we do about distant reading, now that we know that Franco Moretti, the man who coined the phrase “distant reading,” and who remains its most famous exemplar, is among the men named as a result of the #MeToo movement.

Lauren Klein has posted an important blog entry on Distant Reading after MorettiThis essay is based on a talk delivered at the 2018 MLA convention for a panel on Varieties of Digital Humanities. Klein asks about distant reading and whether it shelters sexual harassment in some way. She asks us to put not just the persons, but the structures of distant reading and the digital humanities under investigation. She suggests that it is “not a coincidence that distant reading does not deal well with gender, or with sexuality, or with race.” One might go further and ask if the same isn’t true of the digital humanities in general or the humanities, for that matter. Klein then suggests some thing we can do about it:

  • We need more accessible corpora that better represent the varieties of human experience.
  • We need to question our models and ask about what is assumed or hidden.