Thanks to John, I learned about a gem of a concordance tool for the Mac, PC and Linux called Antconc. It runs on your computer and you can download the tool from the author’s site, Laurence Anthony’s Software. If it is stable it could be a great tool to introduce students to text analysis. Looking at the screenshots it has some nice features for finding n-grams and can handle a set of texts.
Category: Text Analysis
Analysis of 250,000 hacker conversations
From Slashdot a story about the text Analysis of 250,000 hacker conversations. A security company Imperva has been analyzing hacker forums to understand trends, how people learn about hacking, and what are popular strategies.
In the Imperva report, Hacker Intelligence Initiative, Monthly Trends Report #5 (PDF) they describe their methodology as “content analysis” (their quotations) but it mostly involves searching for threads and reading. The report has great examples of the types of discussions.
A good example of how simple text analysis can help industry understanding.
Every story has a beginning
Every story has a beginning is the text of a keynote by Tim Sheratt that nicely weaves individual stories together as an example of what we can do with information technology. I highly recommend it; he quotes Steve Ramsay and Tim Hitchcock to the effect that what is important are the stories of individuals like those he paints through the digital archives he has access to. He sets this humanistic view of how we can use the technology against the Culturomics approach which is trying to turn history and its archives into grist for cultural science. Sheratt calls the culturomic vision “barren” and I tend to agree. He ends by asking,
But who defines the problems?
His answer is Linked Data which “gives us a way to present an alternative to Google’s version of the world. We can argue back against the search engines, defining our own criteria for relevance, and building our own discovery networks.” (And his talk has a link for those who want to view the triples…) I would say that we can also build tools like Voyant (formerly Voyeur, which he uses) to help us begin to tell the stories.
Canadian Writing Research Collaboratory Launch
I am at the Canadian Writing Research Collaboratory (CWRC) launch. CWRC is building a collaborative editing environment that will allow editorial projects to manage the editing of electronic scholarly editions. Among other things CWRC is developing an online XML editor, a editorial workflow management tools, and integrated repository.
The keynote speakers for the event include Shawna Lemay and Aritha Van Herk.
Happy Words Trump Negativity in the English Language
Happy Words Trump Negativity in the English Language is an interesting story about a study by Kloumann and colleagues on Positivity of the English Language. They used Mechanical Turk to get people to assess whether the high frequency words used in Twitter, Books, the New York Times and Music Lyrics were positive. Their study showed that overwhelmingly English is a positive language. Thanks to Stan for this.
The Fight Over the Future of Digital Books
Dan Cohen has written a good summary of the latest fuss over electronic books, The Fight Over the Future of Digital Books. He explains the latest suit by the Authors Guild against the HathiTrust. This suit is the companion to the suit by the Authors Guild of Google that has still not been resolved.
Old Bailey Trials Are Tabulated for Scholars Online
The New York Times now has an article on the Criminal Intent project I was part of. See, Old Bailey Trials Are Tabulated for Scholars Online. They quote a historian who is sceptical of the results of mining, though he appreciates the resource.
“The Old Bailey Online project has done a great service in making those sources widely (and costlessly) available,” Mr. Langbein wrote in an e-mail. But he complained that the claims about data mining have “a breathless quality: ‘you can expect big things from us,’ but as yet it’s all method and no results.” He said that the new findings belittle the work of a generation of scholars who focused on the 18th century as the turning point in the evolution of the criminal justice system.
Alas, he seems didn’t read our report, but the summary in the Chronicle. It is easy to use cute phrases like “breathless quality”, but is he right? Time will tell, but I think the historians on our team have backed up the results found with mining and they never belittled the work of previous scholars – we saw ourselves building on it.
What can mining do? I think mining can give you a big picture so that you see the forest rather than trees in a way that no one could before. Conclusions about the shape of the forest have to be checked against other evidence, but the results of mining is evidence that is not breathless even if it takes your breath away. As Bill Turkel put it,
Mr. Turkel, who developed some of the digital tools, said that data mining reveals unexpected trends and connections that no one would have thought to look for before. Previous scholars “tended to cherry-pick anecdotes without having a sense that it was possible to measure all of that text and treat the whole archive as a single unit,” he said.
Of course, if you then leverage traditional evidence to buttress your argument then the mining is forgotten or trivialized.
The Garden of Error and Decay
The Garden of Error and Decay is a real-time visualization of disasters mentioned in Twitter and other feeds. The text about the interactive says “this innovative moving image format is something like a real-time data driven narrative. This project is not a film, not a game, and not a nonlinear interactive story.” The visualization uses pictograms that represent the type of disaster. You can see the original twitter text.
Thanks to Scott for this.
Father Busa is dead
From Humanist I just found out that Father Roberto Busa has died. See Stop the reader, Fr. Busa has died in L’Osservatore Romano (English) or Morto padre Busa, è stato il pioniere dell’informatica linguistica from the Corriere del Veneto (Italian). Father Busa was a pioneer in humanities computing who started a project in the 1940s with help from IBM to create a complete concordance of Acquinas. The Index Thomisticus was arguably the first (big) humanities project to benefit from computing methods. For that reason the author of Stop the reader argues that,
If you surf the Internet, you owe it to him and if you use a PC to write emails and documents, you owe it to him. And if you can read this article, you owe it to him, we owe it to him
While it may be an exaggeration to say that we owe hypertext and the web to Father Busa, he was certainly one of the first to use computers to manipulate texts on a large scale. He saw the
Father Busa was also involved in developing the humanities computing field which is why we have named a prize after him. (See ADHO Roberto Busa Award). He wrote articles for journals like CHUM and Literary and Linguistic Computing. He was generous with his time and ideas. He was influential in Italy; others will know more about this. I met him in 1998 at the ACH/ALLC conference in Debrecen, Hungary where he was awarded the first Busa Award. As I speak Italian I was asked to join an executive dinner and had a pleasant evening talking about his ideas about hermeneutical text analysis which he delivered in his Award talk and which were later published in “Picture a Man …” in Literary and Linguistic Computing (14:1, 1999). At the end of his talk he played with the Cinderella metaphor for interpretative text analysis,
Metaphor is a linguistic phenomenon: when the name of one reality is chosen to signify another and different reality, because of some similarity between the two. I in fact applied the name of Cinderella to hermeneutical informatics, the two having in common youth, health, beauty, and poverty. Cinderella eventually got married to a prince. (p. 8)
Busa was a prince or perhaps a Cinderella who has now left the party.
Crime’s Digital Past – Science News
Tim sent me a link to another news story on the Criminal Intent project that I am part of. This one is in Science News and is titled, Crime’s Digital Past. The article in by Bruce Bower and dated July 30th, 2011 (which, I know, is in the future.) One of the better stories.





