Digital Humanities Talks at the 2013 MLA Convention

The ACH has put together a useful Guide to Digital-Humanities Talks at the 2013 MLA Convention. I will presenting at various events including:

Juxta Commons

Vis-icons

From Humanist I just learned about Juxta Commons. This is a web version of the earlier downloadable Java tool. The new version still has the lovely interface that shows the differences between variants. The commons however, builds on the personal computer tool by being a place where collations can be kept. Others can find and explore your collations. You can search the commons and find collation projects.

Another interesting feature is that they have Google ads if you search the commons. The search is “powered by Google” so perhaps that comes with the service.

D3.js – Data-Driven Documents

Stéfan pointed me to this new visualization library, D3.js – Data-Driven Documents. The image above is from their Co-Occurrence Matrix (of characters in Les Misérables.) Here is what they say in the About:

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

Take a look the examples Gallery. There are lots of ideas here for text visualization.

FBI quietly forms secretive Net-surveillance unit

From Slashdot another story hinting at how government agencies are organizing to intercept and interpret Internet data. See FBI quietly forms secretive Net-surveillance unit.

My guess is that data mining large amounts of data produces so many false positives that organizations like the NSA and FBI have to set up large units to follow up on results. There is an interesting policy paper by Jeff Jonas and Jim Harper on Effective Counterterrorism and the Limited Role of Predictive Data Mining that argues that predictive mining isn’t worth it. The cost of false positives for industry when they use predictive data mining (predicting who might buy your product) is acceptable. The costs of false positives for counterterrorism are prohibitive as it takes trained agents away from better uses of their time. I doubt anyone in this climate it willing to give up on mining which is why The NSA is Building the Country’s Biggest Spy Center.

I wonder if we will ever know if money spent on voice and text mining is useful in counterintelligence? Perhaps the rumour of the possibility of it working is enough?

Twitter hands your data to the highest bidder, but not to you

The Globe and Mail had a very interesting article on how Twitter hands your data to the highest bidder, but not to you. The article talks about how Twitter is archiving your data, selling it, but not letting you access your old tweets. The article mentions that DataSift is one company that has been licensed to mine the Twitter archives. DataSift presents itself as the “the world’s most powerful and scalable platform for managing large volumes of information from a variety of social data sources.” In effect they do real-time text analysis for industry. Here is what they say in What we do:

DataSift offers the most powerful and sophisticated tools for extracting value from Social Data. The amount of content that Internet users are creating and sharing through Social Media is exploding. DataSift offers the best tools for collecting, filtering and analyzing this data.

Social Data is more complicated to process and analyze because it is unstructured. DataSift’s platform has been built specifically to process large volumes of this unstructured data and derive value from it.

One thing that DataSift has is a curation language called CDSL (Curated Stream Definition Language) for querying the cloud of data they gather. The provide an example of what you can with it:

Here’s an example, just for illustration, of a complex filter that you could build with only four lines of CSDL code: imagine that you want to look at information from Twitter that mentions the iPad. Suppose you want to include content written in English or Spanish but exclude any other languages, select only content written within 100 kilometers of New York City, and exclude Tweets that have been retweeted fewer than five times. You can write that in just four lines of CSDL!

It would be interesting to develop an academic alternative similar to Archive-It, but for real-time social media tracking.

The Old Bailey Datawarehousing Interface

The latest version of our Old Bailey Datawarehousing Interface is up. This was the Digging Into Data project that got TAPoR, Zotero and Old Bailey working together. One of the things we built was an advanced visualization environment for the Old Bailey. This was programmed by John Simpson following ideas from Joerg Sanders. Milena Radzikowska did the interface design work and I wrote emails.

One feature we have added is the broaDHcast widget that allows projects like Criminal Intent to share announcements. This was inspired partly by the issues of keeping distributed projects like TAPoR, Zotero and Old Bailey informed.

Perlin: Interactive Map of Pride and Prejudice

As I mentioned in my post on the GRAND conference, Ken Perlin showed a number of interesting Java apps that illustrated visual ideas. One was a Interactive Map of Pride and Prejudice. This interactive map is a rich prospect of the whole text which you can move around to see particular parts. You can search for words (or strings) and see where they appear in the text. You can select some text and it searches. The interface is simple and intuitive. You can see how Perlin talks about it in his blog. I also recommend you look at his other experiments.

Prism: Collaborative Interpretation

Prism is the coolest idea I have come across in a long time. Coming from the University of Virginia Scholar’s Lab, Prism is a collaborative interpretation environment. Someone comes up with categories like “Rhetoric”, “Orientalism” and “Social Darwinism” for a text like Notes on the State of Virginia. Then people (with accounts, which you can get freely) go through and mark passages. This creates overlapping interpretative markup of the sort you used to get with COCOA in TACT, but unlike TACT, many people can do the interpretation – it can be crowdsourced.

They are planning some visualizations of the results including what look like the types of visualizations that TACT gave where you can see words distributed over tagged areas.

Bethany Nowviskie explains the background to the project in this Scholar’s Lab post.

Robo-Readers Used to Grade Test Essays

A nice story from the New York Times by Michael Winerip, Robo-Readers Used to Grade Test Essays (April 22, 2012) talks automated essay scoring software (AES). The story first reports a study from the University of Akron that showed that AES software is comparable to human graders (see A Win for the Robo-Readers by Steve Kolowich from Inside Higher Ed.) The NYT story goes then to report how Les Perelman, a director of writing at MIT, has shown how you can game AES tools. Among other things they don’t check facts or truth so you can write all sorts of outrageous things and still get a good score from AES. The story discusses some of the patterns that get good scores like lexical variety and long sentences. The story ends with the possibility that AES could be matched by essay writing software,

Two former students who are computer science majors told him (Perelman) that they could design an Android app to generate essays that would receive 6’s from e-Rater. He says the nice thing about that is that smartphones would be able to submit essays directly to computer graders, and humans wouldn’t have to get involved.

Particularly interesting is an essay Perelman wrote to show how poor essays can game the system. I wish I could say that I never saw writing like this and that therefore there was no danger of AES systems rewarding the poor writing found in real essays,

In today’s society, college is ambiguous. We need it to live, but we also need it to love. Moreover, without college most of the world’s learning would be egregious. College, however, has myriad costs. One of the most important issues facing the world is how to reduce college costs. Some have argued that college costs are due to the luxuries students now expect. Others have argued that the costs are a result of athletics. In reality, high college costs are the result of excessive pay for teaching assistants.