Big Data – Page 17 – Theoreti.ca

IBM to close Many Eyes

I just discovered that IBM to close Many Eyes. This is a pity. It was great environment that let people upload data and visualize it in different ways. I blogged about it ages ago (in computer ages anyway.) In particular I liked their Word Tree which seems one of the best ways to explore language use.

It seems that some of the programmers moved on and that IBM is now focusing on Watson Analytics.

Where Probability Meets Literature and Language: Markov Models for Text Analysis

3quarksdaily, one of my favourite sites to read just posted a very nice essay by Sanjukta Paul on Where Probability Meets Literature and Language: Markov Models for Text Analysis. The essay starts with Markov, who in the 19th century was doing linguistic analysis by hand and goes to authorship attribution by people like Fiona Tweedie (the image above is from a study she co-authored). It also explains markov models on the way.

Paolo Sordi: I blog therefore I am

On the ethos of digital presence: I participated today in a panel launching the Italian version of Paolo Sordi’s book I Am: Remix Your Web Identity. (The Italian title is Bloggo Con WordPress Dunque Sono.) The panel included people like Domenico Fiormonte, Luisa Capelli, Daniela Guardamangna, Raul Mordenti, and, of course, Paolo Sordi.

Continue reading Paolo Sordi: I blog therefore I am

LOTRProject: Visualizing the Lord of the Rings

Emil Johansson, a student in Gothenburg, has created a fabulous site called the LOTRProject (or Lord Of The Rings Project. The site provides different types of visualizations about Tolkien’s world (Silmarillion, Hobbit, and LOTR) from maps to family trees to character mentions (see image above).

Continue reading LOTRProject: Visualizing the Lord of the Rings

Is it Pokemon or Big Data ?

Is it Pokemon or Big Data ? is a simple game where you are presented with a name and you have to guess if it is a big data company or a Pokemon creature. My thanks to Jane for this.

NSA to shut down bulk phone surveillance program by Sunday

NSA to shut down bulk phone surveillance program by Sunday. A first step.

Literary Analysis and the Wolfram Language

Lately I’ve been trying Wolfram Mathematica more an more for analytics. I was introduced to Mathematica by Bill Turkel and Ian Graham who have done some impressive stuff with it. Bill Turkel has now created a open access, open content, and open source textbook Digital Research Methods with Mathematica. The text is a Mathematica notebook itself so, if you have Mathematica you can actually use the text to do analytics on the spot.

Wolfram has also posted an interesting blog entry on Literary Analysis and the Wolfram Language: Jumping Down a Reading Rabbit Hole. They show how you can generate word clouds and sentiment analysis graphs easily.

While I am still learning Mathematica, some of the features that make it attractive include:

It uses a “literate programming” model where you write notebooks meant to be read by humans with embedded code rather than writing code with awkward comments embedded.
It has a lot of convenient Web, Language, and Visualization functions that let you do things we want to do in the digital humanities.
You can call on Wolfram Alpha in a notebook to get real world knowledge like capital cities or maps or language information.

Text Mining The Novel 2015

On Thursday and Friday (Oct. 22nd and 23rd) I was at the 2nd workshop for the Text Mining the Novel project. My conference notes are here Text Mining The Novel 2015. We had a number of great papers on the issue of genre (this year’s topic.) Here are some general reflections:

The obvious weakness of text mining is that it operates on the novel as text, specifically digital text (or string.) We need to find ways to also study the novel as material object (thing), as a social object, as a performance (of the reader), and as an economic object in a market place. Then we also have to find ways to connect these.
So many analytical and mining processes depend on bags of words from dictionaries to topics. Is this a problem or a limitation? Can we try to abstract characters, plot, or argument.
I was interested in the philosophical discussions around the epistemological in novels and philosophical claims about language and literature.

Data Management Plan Recommendation

Today I deposited a Data Management Plan Recommendation for Social Science and Humanities Funding Agencies (http://hdl.handle.net/10402/era.42201) in our institutional repository ERA. This report/recommendation was written by Sonja Sapach with help from me and Catherine Middleton. We recommended that:

Agencies that fund social science and humanities (SSH) research should move towards requiring a Data Management Plan (DMP) as part of their application processes in cases where research data will be gathered, generated, or curated. In developing policies, funding agencies should consult the community on the values of stewardship and research that would be strengthened by requiring DMPs. Funding agencies should also gather examples and data about reuse of archived data in the social sciences and humanities and encourage due diligence among researchers to make themselves aware of reusable data.

On the surface the recommendation seems rather bland. SSHRC has required the deposit of research data they fund for decades. The problem, however, is that few of us pay attention because it is one more thing to do, and something that shares hard-won data with others that you may want to continue milking for research. What we lack is a culture of thinking of the deposit of research data as a scholarly contribution the way the translation and edition of important cultural texts is. We need a culture of stewardship as a TC3+ (tri-council) document put it. See Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada (PDF).

Given the potential resistance of colleagues it is important that we understand the arguments for requiring planning around data management and that is one of the things we do in this report. Another issue is how to effectively require at the funding proposal end something (like a Data Management Plan) that would show how the researchers are thinking through the issue. To that end we document the approaches of other funding bodies. The point is that this is not actually that new and some research communities are further ahead.

At the end of the day, what we really need is a recognition that depositing data so that it can be used by other researchers is a form of scholarship. Such scholarship can be assessed like any other scholarship. What is the data deposited and what is its quality? How is the data deposited? How is it documented? Can it have an impact?

You can find this document also at Catherine Middleton’s web site and Sonja Sapach’s web site.