Big Buzz about Big Data: Does it really have to be analyzed.

The Guardian has a story by John Burn-Murdoch on how Study: less than 1% of the world’s data is analysed, over 80% is unprotected.

This Guardian article reports on a Digital Universe Study that reports that the “global data supply reached 2.8 zettabytes (ZB) in 2012” and that “just 0.5% of this is used for analysis”. The industry study emphasizes that the promise of “Big Data” is in its analysis,

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is “tagged” accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of “Big Data” technology — the extraction of value from the large untapped pools of data in the digital universe. (p. 3)

I can’t help wondering if industry studies aren’t trying to stampede us to thinking that there is lots of money to be made in analytics. These studies often seem to come from the entities that benefit from investment into analytics. What if the value of Big Data turns out to be in getting people to buy into analytical tools and services (or be left behind.) Has there been any critical analysis (as opposed to anecdotal evidence) of whether analytics really do warrant the effort? A good article I came across on the need for analytical criticism is Trevor Butterworth’s Goodbye Anecdotes! The Age of Big Data Demands Real Criticsm. He starts with,

Every day, we produce 2.5 exabytes of information, the analysis of which will, supposedly, make us healthier, wiser, and above all, wealthier—although it’s all a bit fuzzy as to what, exactly, we’re supposed to do with 2.5 exabytes of data—or how we’re supposed to do whatever it is that we’re supposed to do with it, given that Big Data requires a lot more than a shiny MacBook Pro to run any kind of analysis.

Of course the Digital Universe Study is not only about the opportunities for analytics. It also points out:

  • That data security is going to become more and more of a problem
  • That more and more data is coming from emerging markets
  • That we could get a lot more useful analysis done if there was more metadata (tagging), especially at the source. They are calling for more intelligence in the gathering devices – the surveillance cameras, for example. They could add metadata at the point of capture like time, place, and then stuff like whether there are faces.
  • That the promising types of data that could generate value start with surveillance and medical data.

Reading about Big Data I also begin to wonder what it is. Fortunately IDC (who are behind the Digital Universe Study have a definition,

Last year, Big Data became a big topic across nearly every area of IT. IDC defines Big Data technologies as a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics. Then there are the products and services that can be wrapped around one or all of these Big Data elements. (p. 9)

Big Data is not really about data at all. It is about technologies and services. It is about the opportunity that comes with “a big topic across nearly every area of IT.” Big Data is more like Big Buzz. Now we know what follows Web 2.0 (and it was never going to be Web 3.0.)

For a more academic and interesting perspective on Big Data I recommend (following Butterworth) Martin Hilbert’s “How much information is there in the ‘information society’?” (Significance, 9:4, 8-12, 2012.) One of the more interesting points he makes is the growing importance of text,

Despite the general percep- tion that the digital age is synonymous with the proliferation of media-rich audio and videos, we find that text and still images cap- ture a larger share of the world’s technological memories than they did before4. In the early 1990s, video represented more than 80% of the world’s information stock (mainly stored in analogue VHS cassettes) and audio almost 15% (on audio cassettes and vinyl records). By 2007, the share of video in the world’s storage devices had decreased to 60% and the share of audio to merely 5%, while text increased from less than 1% to a staggering 20% (boosted by the vast amounts of alphanumerical content on internet servers, hard disks and databases.) The multimedia age actually turns out to be an alphanumeric text age, which is good news if you want to make life easy for search engines. (p. 9)

One of the points that Hilbert makes that would support the importance of analytics is that our capacity to store data is catching up with the amount of data broadcast and communicated. In other words we are getting closer to being able to be able store most of what is broadcast and communicated. Even more dramatic is the growth in computation. In short available computation is growing faster than storage and storage faster than transmission. With excess comes experimentation and with excess computation and storage, why not experiment with what is communicated. We are, after all, all humanists who are interested primarily ourselves. The opportunity to study ourselves in real time is too tempting to give up. There may be little commercial value in the Big Reflection, but that doesn’t mean it isn’t the Big Temptation. The Delphic oracle told us to Know Thyself and now we can in a new new way. Perhaps it would be more accurate to say that the value in Big Data is in our narcissism. The services that will do well are those that feed our Big Desire to know more and more (recently) ourselves both individually and collectively. Privacy will be trumped by the desire for analytic celebrity where you become you own spectacle.

This could be good news for the humanities. I’m tempted to announce that this will be the century of the BIG BIG HUMAN. With Big Reflection we will turn on ourselves and consume more and more about ourselves. The humanities could claim that we are the disciplines that reflect on the human and analytics are just another practice for doing so, but to do so we might have to look at what is written in us or start writing in DNA.

In 2007, the DNA in the 60 trillion cells of one single human body would have stored more information than all of our technological devices together. (Hilbert, p. 11)

Reacting to the Past

Reacting to the Past is the name of a set of games designed to get students thinking about historical moments. Students play out the games that are set in the past and use texts to inform their play. The instructor then just facilitates the class and grades their work. It reminds me of role playing events like the Model United Nations, but with history and ideas being modeled. Now I have to find a workshop to go to to learn more because the materials are behind a password.

Sample on Randomness

Mark Sample has posted his gem of a MLA paper on An Account of Randomness in Literary Computing. I wish I could write papers quite so clear and evocative. He combines interesting historical examples to a question that crosses all sorts of disciplines – that of randomness. He shows how the importance of randomness connects to poetic experiments in computing.

I would recommend reading the article immediately, but I discovered, as with many good works, I ended up spending a lot of time following up the links and reading stuff on sites like the MIT 150 Exhibition which has a section on Analog/Digital MIT with online exhibits on subjects like the MIT Project Athena and the TX-0. Instead I will warn – beware of reading interesting things!

Lack of guidelines create ethical dilemmas in social network-based research

e! Science News has a story about an article in Science about how a Lack of guidelines create ethical dilemmas in social network-based research.

The full article by Shapiro and Ossorio, Regulation of Online Social Network Studies can be found in the 11 January, 2013 issue of Science (Vol. 339 no. 6116, pp. 144-45.)

The Internet has been a godsend for all sorts of research as it lets us scrape large amounts of data representing discourse about a subject without having to pay for interviews or other forms of data gathering. It has been a boon for those of us using text analysis or those in computational linguistics. At the same time, much of what we gather can be written by people that we would not be allowed to interview without careful ethics review. Vulnerable people and youth can leave a trail of information on the Internet that we wouldn’t normally be allowed to gather directly without careful protects.

I participated many years ago in symposium on this issue. The case we were considering involved scraping breast cancer survivor blogs. In addition to the issue of the vulnerability of the authors we discussed whether they understood that their blogs were public, or if they considered posting on a blog like talking to a friend in a public space. At the time it seemed that many bloggers didn’t realize how they could be searched, found, and scraped. A final issue discussed was the veracity of the blogs. How would a researcher know they were actually reading a blog by a cancer survivor? How would they know the posts were authentic without being able to question the writer? Like all symposia we left with more questions than answers.

In the end an ethics board authorized the study. (I was on neither the study or the board – just part of a symposium to discuss the issue.)

Digital Humanities Pedagogy: Practices, Principles and Politics

Open Book Publishers has just published Digital Humanities Pedagogy: Practices, Principles and Politics online. Stéfan Sinclair and I have two chapters in the collection, one on “Acculturation and the Digital Humanities Community” and one on “Teaching Computer-Assisted Text Analysis.”

The Acculturation chapter sets out the ways in which we try to train students by involving them in project teams rather than only through courses. This approach I learned watching Jerome McGann and Johanna Drucker at the University of Virginia. My goal has always to be able to create the sort of project culture they did (and now the Scholar’s Lab continues.)

The editor Brett D. Hirsch deserves a lot of credit for gently seeing this through.

MLA 2013 Conference Notes

I’ve just posted my MLA 2013 convention notes on philosophi.ca (my wiki). I participated in a workshop on getting started with DH organized by DHCommons, gave a paper on “thinking through theoretical things”, and participated in a panel on “Open Sesame” (interoperability for literary study.)

The sessions seemed full, even the theory one which started at 7pm! (MLA folk are serious about theorizing.)

At the convention the MLA announced and promoted a new digital MLA Commons. I’ve been poking around and trying to figure out what it will become. They say it is “a developing network linking members of the Modern Language Association.” I’m not sure I need one more venue to link to people, but it could prove an important forum if promoted.

Tasman: Literary Data Processing

I came across a 1957 article by an IBM scientist, P. Tasman on the methods used in Roberto Busa’s Index Thomisticus project, with the title Literary Data Processing (IBM Journal of Research and Development, 1(3): 249-256.) The article, which is in the third issue of the IBM Journal of Research and Development, has an illustration of how they used punch cards for this project.

Image of Punch Card

While the reproduction is poor, you can read the things encoded on the card for each word:

  • Location in text
  • Special reference mark
  • Word
  • Number of word in text
  • First letter of preceding word
  • First letter of following word
  • Form card number
  • Entry card number

At the end Tasman speculates on how these methods developed on the project could be used in other areas:

Apart from literary analysis, it appears that other areas of documentation such as legal, chemical, medical, scientific, and engineering information are now susceptible to the methods evolved. It is evident, of course, that the transcription of the documents in these other fields necessitates special sets of ground rules and codes in order to provide for information retrieval, and the results will depend entirely upon the degree and refinement of coding and the variety of cross referencing desired.

The indexing and coding techniques developed by this method offer a comparatively fast method of literature searching, and it appears that the machine-searching application may initiate a new era of language engineering. It should certainly lead to improved and more sophisticated techniques for use in libraries, chemical documentation, and abstract preparation, as well as in literary analysis.

Busa’s project may have been more than just the first humanities computing project. It seems to be one of the first projects to use computers in handling textual information and a project that showed the possibilities for searching any sort of literature. I should note that in the issue after the one in which Tasman’s article appears you have an article by H. P. Luhn (developer of the KWIC) on A Statistical Approach to Mechnized Encoding and Searching of Literary Information. (IBM Journal of Research and Development 1(4): 309-317.) Luhn specifically mentions the Tasman article and the concording methods developed as being useful to the larger statistical text mining that he proposes. For IBM researchers Busa’s project was an important first experiment handling unstructured text.

I learned about the Tasman article in a journal paper deposited by Thomas Nelson Winter on Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance. The paper gives an excellent account of Busa’s project and its significance to concording. Well worth the read!

Digital Humanities Talks at the 2013 MLA Convention

The ACH has put together a useful Guide to Digital-Humanities Talks at the 2013 MLA Convention. I will presenting at various events including:

GAME THEORY in the NYTimes

Just in time for Christmas, the New York Times has started an interesting ArtsBeat Blog called GAME THEORY. It is interesting that this multi-authored blog is in the “Arts Beat” area as opposed to under the Technology tab where most of the game stories are. Game Theory seems to want to take a broader view of games and culture as the second post on Caring About Make-Believe Body Counts illustrates. This post starts by addressing the other blog columnists (as if this were a dialogue) and then starts with Wayne LaPierre’s speech about how to deal with the Connecticut school killings that blames, among other things, violent games. The column then looks at the discourse around violence in games including voices within the gaming industry that were critical of ultraviolence.

Those familiar with games who debate the medium’s violence now commonly assume that games may have become too violent. But they don’t assume that games should be free of violence. That is because of fake violence’s relationship with interactivity, which is a defining element of video games.

Stephen Totilo ends the column with his list of the best games of 2012 which includes Super Hexagon, Letterpress, Journey, Dys4ia, and Professor Layton and the Miracle Mask.

As I mentioned above, the blog column has a dialogical side with authors addressing each other. It also brings culture and game culture together which reminds me of McLuhan who argued that games reflect society providing a form of catharsis. This column promises to theorize culture through the lens of games rather than just theorize games.