The Intercept and CBC have been collaborating on stories based on documents leaked by Edward Snowden. One recent story is about how Canadian Spies Collect Domestic Emails in Secret Security Sweep. CSE is collecting email going to the government and flagging suspect emails for analysts.
An earlier story titled CSE’s Levitation project: Expert says spy agencies ‘drowning in data’ and unable to follow leads, tells about the LEVITATION project that monitors file uploads to free file hosting sites. The idea is to identify questionable uploads and then to figure out who is uploading the materials.
Glenn Greenwald (see the embedded video) questions the value of this sort of mass surveillance. He suggests that mass surveillance impedes the ability to find terrorists attacks. The problem is not getting more information, but connecting the dots of what one has. In fact the slides that you can get to from these stories both show that CSE is struggling with too much information and analytical challenges.
Stéfan Sinclair and I just finished a workshop on My Very Own Voyant. The workshop focused on how to run VoyantServer on your local machine. This allows you to run Voyant locally. There are all sorts of reasons to run locally:
- It runs faster
- You can upload large texts faster
- It can process larger text corpora
- You can control the server
- You can keep your corpora confidential
You can download VoyantServer and read instructions here.
From Geoff I learned about The Isolator, A Bizarre Helmet For Encouraging Concentration (1925). The Isolator was developed in 1925 by Hugo Gernsback a science fiction pioneer (and editor of Science and Invention magazine.) The idea is to force you to focus on your writing (with lots of oxygen.)
One wonders if it works? Could it be even more useful now?
The New York Times has an interesting way of visualizing fashion that you can see in their article Front Row to Fashion Week – Interactive Feature. They have abstracted the colour hues to create small swatches of different designers who showed at the New York Fashion Week. These “sparklines” or sparkboxes are an interesting way to compare the shows by designers.
On July 11th and 12th I was at a conference in Saskatoon on Social Digital Scholarly Editing. This conference was organized by Peter Robinson and colleagues at the University of Saskatchewan. I kept conference notes here.
I gave a paper on “Social Texts and Social Tools.” My paper argued for text analysis tools as a “reader” of editions. I took the extreme case of big data text mining and what scraping/mining tools want in a text and don’t want in a text. I took this extreme view to challenge the scholarly editing view that the more interpretation you put into an edition the better. Big data wants to automate the process of gathering and mining texts – big data wants “clean” texts that don’t have markup, annotations, metadata and other interventions that can’t be easily removed. The variety of markup in digital humanities projects makes it very hard to clean them.
The response was appreciative of the provocation, but (thankfully) not convinced that big data was the audience of scholarly editors.
We are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.
Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.
“ii” 1514 (Not sure why)
I have been working for a while on archiving the Globalization Compendium which I worked on. Yesterday I got it archived in two Institutional Repositories:
In both cases there is a Zip of a BagIt bag with the XML files, code and other documentation from the site. My first major deposit.
The New York Times now has an article on the Criminal Intent project I was part of. See, Old Bailey Trials Are Tabulated for Scholars Online. They quote a historian who is sceptical of the results of mining, though he appreciates the resource.
“The Old Bailey Online project has done a great service in making those sources widely (and costlessly) available,” Mr. Langbein wrote in an e-mail. But he complained that the claims about data mining have “a breathless quality: ‘you can expect big things from us,’ but as yet it’s all method and no results.” He said that the new findings belittle the work of a generation of scholars who focused on the 18th century as the turning point in the evolution of the criminal justice system.
Alas, he seems didn’t read our report, but the summary in the Chronicle. It is easy to use cute phrases like “breathless quality”, but is he right? Time will tell, but I think the historians on our team have backed up the results found with mining and they never belittled the work of previous scholars – we saw ourselves building on it.
What can mining do? I think mining can give you a big picture so that you see the forest rather than trees in a way that no one could before. Conclusions about the shape of the forest have to be checked against other evidence, but the results of mining is evidence that is not breathless even if it takes your breath away. As Bill Turkel put it,
Mr. Turkel, who developed some of the digital tools, said that data mining reveals unexpected trends and connections that no one would have thought to look for before. Previous scholars “tended to cherry-pick anecdotes without having a sense that it was possible to measure all of that text and treat the whole archive as a single unit,” he said.
Of course, if you then leverage traditional evidence to buttress your argument then the mining is forgotten or trivialized.
I had heard about Bill Turkel’s ‘super secret’ project and how he had decided to keep the idea of the project secret but share the method, which is the opposite of what we usually do. As I am not on research leave (sabbatical) and working on 5 books (ha!) I thought I should learn from Bill. Here is the link to his excellent research workflow, How To « William J Turkel. What I like is that it is all stuff you can do with off-the-shelf tools, though not necessarily free ones.
The CIRCA Histories and Archives group I am part of is organizing the University of Alberta’s first Digitization Day.
This one-day event is a chance for research projects that are digitizing evidence to meet up with each other and with units on campus that provide relevant research services. Projects that are creating digital archives of different sorts will give short presentations as will units on campus that support research.
The idea is to bring a lot of digitization projects together to learn about each other and what is happening on campus. My sense is that we have hit a critical mass on campus and now that we have a trusted digital repository ERA (Education and Research Archive) it is time to start talking and sharing knowledge. Each project should not have to reinvent itself.