The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. Their collection is not as complete as the Internet Archive, which goes back much further, but it is available in large datasets for research.
The Naylor Report (PDF) about research funding in Canada is out and we put it in Voyant. Here are some different
- Here is the default Corpus View
- Here it is in the Topics (Topic Modelling) View
- Here is the Scatter Plot (Correspondence Analysis) View (see image above)
Domenico Fiormonte has recently blogged about an interesting document he has by Father Busa that relates to a difficult moment in the history of the digital humanities in Italy in 2002. The two page “Conditional Agreement”, which I translate below, was given to Domenico and explained the terms under which Busa would agree to sign a letter to the Minister (of Education and Research) Moratti in response to Moratti’s public statement about the uselessness of humanities informatics. A letter was being prepared to be signed by a large number of Italian (and foreign) academics explaining the value of what we now call the digital humanities. Busa had the connections to get the letter published and taken seriously for which reason Domenico visited him to get his help, which ended up being conditional on certain things being made clear, as laid out in the document. Domenico kept the two pages Busa wrote and recently blogged about them. As he points out in his blog, these two pages are a mini-manifesto of Father Busa’s later views of the place and importance of what he called textual informatics. Domenico also points out how political is the context of these notes and the letter eventually signed and published. Defining the digital humanities is often about positioning the field in the larger academic and public political spheres we operate in.
Arianne Mayer has posted a thorough review of our book Hermeneutica on Sens Public under the title, Hermeneutica, une expérience numérique de l’interprétation (in French.) She notes the centrality of dialogue and in the spirit of dialogue ends with some good questions about silence to keep the dialogue going,
Pour continuer le dialogue, on gagnerait à faire converser Hermeneutica avec des théories de la lecture comme celle d’Umberto Eco ou avec l’esthétique de la réception, représentée par Hans Robert Jauss et Wolfgang Iser. Aux yeux d’Umberto Eco (Lector in fabula), il n’y a à interpréter que là où le texte se tait. Ce sont tous les lieux d’ambivalence, les propositions implicites et les vides de l’œuvre, suscitant la coopération d’un lecteur qui met du sien dans le texte pour combler les blancs, qui font le propre du fonctionnement littéraire. Wolfgang Iser (L’Appel du texte) affirme de son côté que, loin de déduire le sens d’une œuvre de ses mots les plus utilisés, « l’essentiel d’un texte est ce qu’il passe sous silence ».
How can we analyze the gaps, the silences, or that which has not been written?
From the BBC a story about US start-up Geofeedia ‘allowed police to track protesters’. Geofeedia is apparently using social media data from Twitter, Facebook and Instagram to monitor activists and protesters for law enforcement. Access to these social media was changed once the ACLU reported on the surveillance product. The ACLU discovered the agreements with Geofeedia when they requested public records of California law enforcement agencies. Geofeedia was boasting to law enforcement about their access. The ACLU has released some of the documents of interest including a PDF of a Geofeedia Product Update email discussing “sentiment” analytics (May 18, 2016).
Frome the Geofeedia web site I was surprised to see that they are offering solutions for education too.
An article about authorship attribution led me to this nice site on Common Errors in English Usage. The site is for a book with that title, but the author Paul Brians has organized all the errors into a hypertext here. For example, here is the entry on why you shouldn’t use enjoy to.
What does this have to do with authorship attribution? In a paper on Authorship Identification on the Large Scale the authors try using common errors as feature to discriminate potential authors.
The Canadian Writing Research Collaboratory (CWRC) today launched its Collaboratory. The Collaboratory is a distributed editing environment that allows projects to edit scholarly electronic texts (using CWRC Writer), manage editorial workflows, and publish collections. There are also links to other tools like CWRC Catalogue and Voyant (that I am involved in.) There is an impressive set of projects already featured in CWRC, but it is open to new projects and designed to help them.
Susan Brown deserves a lot of credit for imagining this, writing the CFI (and other) proposals, leading the development and now managing the release. I hope it gets used as it is a fabulous layer of infrastructure designed by scholars for scholars.
One important component in CWRC is CWRC-Writer, an in-browser XML editor that can be hooked into content management systems like the CWRC back-end. It allows for stand-off markup and connects to entity databases for tagging entities in standardized ways.
At the European Summer University in Digital Humanities 2016 I was luck to be able to attend some sessions on Stylometry run by Maciej Eder. In his historical review he mentioned people like Valla and Mendenhall, but also mentioned a fellow Pole, Wincenty Lutoslawksi whose book The origin and growth of Plato’s logic; with an account of Plato’s style and of the chronology of his writings (1897) is the first to use the term “stylometry”. Lutoslawski develops a Theory of Stylometry and reviewed “500 peculiarities of Plato’s style” as part of his work on Plato’s logic. The nice thing is that the book is available through the Internet Archive.
Eder has a nice page about the work he and ogthers in the Computational Stylistics Group are doing. In the workshop sessions I was able to attend he showed us how to set up and run his “stylo” package (PDF) that provides a simple user interface over R for doing stylometry. He also showed us how to then use Gephi for network visualization.
They know is a must see design project by Christian Gross from the Interface Design Programme at University of Applied Sciences in Potsdam (FHP), Germany. The idea behind the project, described in the They Know showcase for FHP, is,
I could see in my daily work how difficult it was to inform people about their privacy issues. Nobody seemed to care. My hypothesis was that the whole subject was too complex. There were no examples, no images that could help the audience to understand the process behind the mass surveillance.
The answer is to mock up a design fiction of an NSA surveillance dashboard based on what we know and then a video describing a fictional use of it to track an architecture student from Berlin. It seems to me the video and mock designs nicely bring together a number of things we can infer about the tools they have.
As I get ready to fly back to Germany I’m finishing my conference notes on Congress 2016 (CSDH and CGSA). Calgary was nice and not to hot for Congress and we were welcomed by a malware attack on Congress that meant that many employees couldn’t use their machines. Nevertheless the conference seemed very well organized and the campus lovely.
My conference notes cover mostly the Canadian Society for Digital Humanities, but also DHSI at Congress, where I presented CWRC for Susan Brown, and the last day of the Canadian Game Studies Association. Here are some general reflections.
- I am impressed by how the CGSA is growing and how vital it is. It has as many attendees as CSDH, but younger and enthusiastic attendees rather than tired. Much of the credit goes to the long term leadership of people like Jen Jensen.
- CSDH has some terrific keynotes this year starting with Ian Milligan, then Tara McPherson, and finally Diane Jakacki.
- It was great to see people coming up from the USA as CSDH/SCHN gets a reputation for being a welcoming conference in North America.
- Stéfan Sinclair and I had a book launch for Hermeneutica: Computer-Assisted Interpretation in the Humanities at which Chad Gaffield said a few words. It was gratifying that so many friends came out for this.
At the CSDH AGM we passed a motion to adopt Guidelines on Digital Scholarship in the Humanities (Google Doc). The Guidelines discuss the value of digital work and provide guidelines for evaluation:
Programs of research, which are by nature exploratory, may require faculty members to take up modes of research that depart from methods they have previously used, therefore the form the resulting scholarship takes should not prejudice its evaluation. Original works in new media forms, whether digital or other, should be evaluated as scholarship following best practices if so presented. Likewise, researchers should be encouraged to experiment with new forms when disseminating knowledge, confident that their experiments will be fairly evaluated.
The Guidelines have a final section on Documented Deposit:
Digital media have not only expanded the forms that research can take, but research practices are also changing in the face of digital distribution and open access publishing. In particular we are being called on to preserve research data and to share new knowledge openly. Universities that have the infrastructure should encourage faculty to deposit not only digital works, but also curated datasets and preprint versions of papers/monographs with documentation in an open access form. These can be deposited with an embargo in digital archives as part of good practice around research dissemination and preservation. The deposit of work, including online published work, even if it is available elsewhere, ensures the long-term preservation by ensuring that there are copies in more than one place. Further, libraries can then ensure that the work is not only preserved, but is discoverable in the long term as publications come and go.