Big Buzz about Big Data: Does it really have to be analyzed.

The Guardian has a story by John Burn-Murdoch on how Study: less than 1% of the world’s data is analysed, over 80% is unprotected.

This Guardian article reports on a Digital Universe Study that reports that the “global data supply reached 2.8 zettabytes (ZB) in 2012” and that “just 0.5% of this is used for analysis”. The industry study emphasizes that the promise of “Big Data” is in its analysis,

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is “tagged” accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of “Big Data” technology — the extraction of value from the large untapped pools of data in the digital universe. (p. 3)

I can’t help wondering if industry studies aren’t trying to stampede us to thinking that there is lots of money to be made in analytics. These studies often seem to come from the entities that benefit from investment into analytics. What if the value of Big Data turns out to be in getting people to buy into analytical tools and services (or be left behind.) Has there been any critical analysis (as opposed to anecdotal evidence) of whether analytics really do warrant the effort? A good article I came across on the need for analytical criticism is Trevor Butterworth’s Goodbye Anecdotes! The Age of Big Data Demands Real Criticsm. He starts with,

Every day, we produce 2.5 exabytes of information, the analysis of which will, supposedly, make us healthier, wiser, and above all, wealthier—although it’s all a bit fuzzy as to what, exactly, we’re supposed to do with 2.5 exabytes of data—or how we’re supposed to do whatever it is that we’re supposed to do with it, given that Big Data requires a lot more than a shiny MacBook Pro to run any kind of analysis.

Of course the Digital Universe Study is not only about the opportunities for analytics. It also points out:

  • That data security is going to become more and more of a problem
  • That more and more data is coming from emerging markets
  • That we could get a lot more useful analysis done if there was more metadata (tagging), especially at the source. They are calling for more intelligence in the gathering devices – the surveillance cameras, for example. They could add metadata at the point of capture like time, place, and then stuff like whether there are faces.
  • That the promising types of data that could generate value start with surveillance and medical data.

Reading about Big Data I also begin to wonder what it is. Fortunately IDC (who are behind the Digital Universe Study have a definition,

Last year, Big Data became a big topic across nearly every area of IT. IDC defines Big Data technologies as a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics. Then there are the products and services that can be wrapped around one or all of these Big Data elements. (p. 9)

Big Data is not really about data at all. It is about technologies and services. It is about the opportunity that comes with “a big topic across nearly every area of IT.” Big Data is more like Big Buzz. Now we know what follows Web 2.0 (and it was never going to be Web 3.0.)

For a more academic and interesting perspective on Big Data I recommend (following Butterworth) Martin Hilbert’s “How much information is there in the ‘information society’?” (Significance, 9:4, 8-12, 2012.) One of the more interesting points he makes is the growing importance of text,

Despite the general percep- tion that the digital age is synonymous with the proliferation of media-rich audio and videos, we find that text and still images cap- ture a larger share of the world’s technological memories than they did before4. In the early 1990s, video represented more than 80% of the world’s information stock (mainly stored in analogue VHS cassettes) and audio almost 15% (on audio cassettes and vinyl records). By 2007, the share of video in the world’s storage devices had decreased to 60% and the share of audio to merely 5%, while text increased from less than 1% to a staggering 20% (boosted by the vast amounts of alphanumerical content on internet servers, hard disks and databases.) The multimedia age actually turns out to be an alphanumeric text age, which is good news if you want to make life easy for search engines. (p. 9)

One of the points that Hilbert makes that would support the importance of analytics is that our capacity to store data is catching up with the amount of data broadcast and communicated. In other words we are getting closer to being able to be able store most of what is broadcast and communicated. Even more dramatic is the growth in computation. In short available computation is growing faster than storage and storage faster than transmission. With excess comes experimentation and with excess computation and storage, why not experiment with what is communicated. We are, after all, all humanists who are interested primarily ourselves. The opportunity to study ourselves in real time is too tempting to give up. There may be little commercial value in the Big Reflection, but that doesn’t mean it isn’t the Big Temptation. The Delphic oracle told us to Know Thyself and now we can in a new new way. Perhaps it would be more accurate to say that the value in Big Data is in our narcissism. The services that will do well are those that feed our Big Desire to know more and more (recently) ourselves both individually and collectively. Privacy will be trumped by the desire for analytic celebrity where you become you own spectacle.

This could be good news for the humanities. I’m tempted to announce that this will be the century of the BIG BIG HUMAN. With Big Reflection we will turn on ourselves and consume more and more about ourselves. The humanities could claim that we are the disciplines that reflect on the human and analytics are just another practice for doing so, but to do so we might have to look at what is written in us or start writing in DNA.

In 2007, the DNA in the 60 trillion cells of one single human body would have stored more information than all of our technological devices together. (Hilbert, p. 11)

MLA 2013 Conference Notes

I’ve just posted my MLA 2013 convention notes on philosophi.ca (my wiki). I participated in a workshop on getting started with DH organized by DHCommons, gave a paper on “thinking through theoretical things”, and participated in a panel on “Open Sesame” (interoperability for literary study.)

The sessions seemed full, even the theory one which started at 7pm! (MLA folk are serious about theorizing.)

At the convention the MLA announced and promoted a new digital MLA Commons. I’ve been poking around and trying to figure out what it will become. They say it is “a developing network linking members of the Modern Language Association.” I’m not sure I need one more venue to link to people, but it could prove an important forum if promoted.

GAME THEORY in the NYTimes

Just in time for Christmas, the New York Times has started an interesting ArtsBeat Blog called GAME THEORY. It is interesting that this multi-authored blog is in the “Arts Beat” area as opposed to under the Technology tab where most of the game stories are. Game Theory seems to want to take a broader view of games and culture as the second post on Caring About Make-Believe Body Counts illustrates. This post starts by addressing the other blog columnists (as if this were a dialogue) and then starts with Wayne LaPierre’s speech about how to deal with the Connecticut school killings that blames, among other things, violent games. The column then looks at the discourse around violence in games including voices within the gaming industry that were critical of ultraviolence.

Those familiar with games who debate the medium’s violence now commonly assume that games may have become too violent. But they don’t assume that games should be free of violence. That is because of fake violence’s relationship with interactivity, which is a defining element of video games.

Stephen Totilo ends the column with his list of the best games of 2012 which includes Super Hexagon, Letterpress, Journey, Dys4ia, and Professor Layton and the Miracle Mask.

As I mentioned above, the blog column has a dialogical side with authors addressing each other. It also brings culture and game culture together which reminds me of McLuhan who argued that games reflect society providing a form of catharsis. This column promises to theorize culture through the lens of games rather than just theorize games.

Short Guide To Evaluation Of Digital Work

The Journal of Digital Humanities has republished my Short Guide to Evaluation of Digital Work as part of an issue on Closing the Evaluation Gap (Vol. 1, No. 4). I first wrote the piece for my wiki and you can find the old version here. It is far more useful bundled with the other articles in this issue od JDH.

The JDH is a welcome experiment in peer-reviewed republication. One thing they do is to select content that has been published in other forms (blogs, online essays and so on) and then edit it for recombination in a thematic issue. The JDH builds on the neat Digital Humanities Now that showcases neat stuff on the web. Both are projects of the Roy Rosenzweig Center for History and New Media. The CHNM deserved credit for thinking through what we can do with the openness of the web.

Clay Shirky: Napster, Udacity, and the Academy

Clay Shirky has a good essay on Napster, Udacity, and the Academy on his blog which considers who will be affected by MOOCs. He makes a number of interesting points:

  • A number of the changes that Internet has facilitated involved unbundling services that were bundled in other media. He gives the example of individual songs being unbundled from albums, but he could also have mentioned how classifieds have been unbundled from newspapers. Likewise MOOCs (Massive Open Online Courses), like the Introduction to Artificial Intelligence run by Peter Norvig and Sebastian Thrun at Stanford, unbundle the course from the university and certification.
  • University lectures are inefficient, a poor way of teaching, and often not of the highest quality. Chances are there are better video lectures online for any established subject than what is offered locally. If universities fall into the trap of saving money by expanding class sizes until higher education is just a series of lectures and exams then we can hardly pretend to higher quality than MOOCs. Why would students in Alberta want to listen to me lecture when they could have someone from Harvard?
  • MOOCs are far more likely to threaten the B colleges than the elite liberal arts colleges. A MOOC is not a small seminar experience for a top student and doesn’t compete with the high end. MOOCs compete with lectures (why not have the best lecturer) and other passive learning approaches. MOOCs will threaten the University of Phoenix and other online programs that are not doing such a good job at retention anyway.
  • MOOCs are great marketing for the elite universities which is why they may thrive even if there is no financial model or evaluation.
  • The openness is important to MOOCs. Shirky gives the example of a Statistics 101 course that was improved by open criticism. By contrast most live courses aren’t open to peer evaluation. Instead they are treated like confidential instructor-patient interactions.

While I agree with much of what Shirky says and I welcome MOOCs I’m not convinced they will have the revolutionary effect some think they will. I remember seeing the empty television frames at Scarborough college from when they thought teleducation was going to be the next thing. When it comes to education we seem to forget over an over that there is a rich history of distance education experiments. Shirky writes, “In the academy, we lecture other people every day about learning from history. Now its our turn…” but I don’t see evidence that he has done any research into the history of education. Instead Shirky adopts a paradigm shift rhetoric comparing MOOCs to Napster in potential for disruption as if that were history. We could just as easily compare them to the explosion of radio experiments between the wars (that disappeared by 1940.) Just how would we learn from history? What history is relevant here? Shirky is unconvincing in his choice of Napster as the relevant lesson.

Another issue I have is epistemological – I just don’t think MOOCs are that different from a how-to book or learning video when it comes to the delivery of knowledge. Anyone who wants to learn something in the West has a wealth of choices and MOOCs, if well designed, are one more welcome choice, but revolutionary they are not. The difficult issues around education don’t have to do with quality resources, but with time (for people trying to learn while holding down a job), motivation (to keep at it), interaction (to learn from mistakes) and certification (to explain what you know to others).

Now its my turn to learn from history. I propose these possible lessons:

  • Unbundling will have an effect on the university, especially as costs escalate faster than inflation. We cannot expect society to pay at this escalating rate especially with the cost of health care eating into budgets. Right now what seems to be being unbundled is the professoriate from teaching as more and more teaching is done by sessionals. Do we really want to leave experiments in unbundling exclusively to others or are we willing to take responsibility for experimenting ourselves?
  • One form of unbundling that we should experiment with more is unbundling the course from the class. Universities are stuck in the course = 12/3 weeks of classes on campus. These are the easiest way for us to run courses as we have a massive investment in infrastructure, but they aren’t necessarily the most convenient for students or subject matter. For graduate programs especially we should be experimenting with hybrid delivery models.
  • Universities may very well end up not being the primary way people get their post-secondary education. Universities may continue as elite institutions leaving it to professional organizations, colleges and distance education institutions to handle the majority of students.
  • Someone is going to come up with a reputable certification process for students who want to learn using a mix of books, study groups, MOOCs, college courses and so on. Imagine if a professional organization like the Chartered Accountants of Canada started offering a robust certification process that was independent of any university degree. For a fee you could take a series of tests and practicums that, if passed, would get you provincial certification to practice.
  • The audience for MOOCs is global, not local. MOOCs may be a gift we in wealthier countires can give to the educationally limited around the world. Openly assessed MOOCs could supplement local learning and become a standard against which people could compare their own courses. On the other hand we could end up with an Amazon of education where one global entity drives out all but the elite educational institutions (which use it to stay elite.) Will we find ourselves talking about educational security (a national needing their own educational system) and learning local (not taking courses from people that live more than 100K away)?
  • We should strive for a wiki model for OOCs where they are not the marketing tools of elite institutions but maintained by the community.

In sum, we should welcome any new idea for learning, including MOOCs. We should welcome OOCs as another way of learning that may suit many. We should try developing OOCs (the M part we can leave to Stanford) and assess them. We should be open to different configurations of learning and not assume that how we do things now has any special privilege.

20 Years Of Texting

It has been apparently 20 years since the first text message was sent according to stories like this one, 20 Years Of Texting: The Rise And Fall Of LOL from Business Insider.

 The first text message was sent on 3 December 1992, when the 22-year-old British engineer Neil Papworth used his computer to wish a “Merry Christmas” to Richard Jarvis, of Vodafone, on his Orbitel 901 mobile phone. Papworth didn’t get a reply because there was no way to send a text from a phone in those days. That had to wait for Nokia’s first mobile phone in 1993.

What is interesting is that texting is declining. FT reports a “steep drop in festive Christmas and New Year text messaging this year…”. With smartphones that can do email, apps on smartphones, and plans that make it affordable to call, we have more and more choices. Soon l33t will become an endangered language.

Hype Cycle from Gartner Inc.

Gartner has an interesting Hype Cycle Research methodology that is based on a visualization.

When new technologies make bold promises, how do you discern the hype from what’s commercially viable? And when will such claims pay off, if at all? Gartner Hype Cycles provide a graphic representation of the maturity and adoption of technologies and applications, and how they are potentially relevant to solving real business problems and exploiting new opportunities.

The method assumes a cycle that new technologies take from:

  • Technology Trigger
  • Peack of Inflated Expectations
  • Trough of Disillusionment
  • Slope of Enlightenment
  • Plateau of Productivity

Here is an example from the Wikipedia:

 

Pundit: A novel semantic web annotation tool

Susan pointed me to Pundit: A novel semantic web annotation tool. Pundit (which has a great domain name “thepund.it”) is an annotation tool that lets people create and share annotations on web materials. The annotations are triples that can be saved and linked into DBpedia and so on. I’m not sure I understand how it works entirely, but the demo is impressive. It could be the killer-app of semantic web technologies for the digital humanities.

Goodbye Minitel

The French have pulled the plug on Minitel, the videotex service that was introduced in 1982, 30 years ago. I remember seeing my first Minitel terminal in France where I lived briefly in 1982-83. I wish I could say I understood it at the time for what it was, but what struck me then was that it was a awkward replacement for the phonebook. Anyway, as of June 30th, Minitel is no more and France says farewell to the Minitel.

Minitel is important because it was the first large-scale information service. It turned out to not be a scalable and flexible as the web, but for a while it provided the French with all sorts of text services from directories to chat. It is famous for the messageries roses (pink messages) or adult chat services that emerged (and helped fund the system.)

In Canada Bell introduced in the late 1980s a version of Minitel called Alex (after Alexander Graham Bell) first in Quebec and then in Ontario. The service was too expensive and never took off. Thanks to a letter in today’s Globe I discovered that there were some interesting research and development into videotex services in Canada at the Canadian Research Communications Centre in the late 1970s and 1980s. Telidon was a “second generation” system that had true graphics, unlike Minitel.

Despite all sorts of interest and numerous experiments, videotex was never really successful outside of France/Minitel. It needs a lot of content for people to be willing to pay the price and the broadcast model of most trials meant that you didn’t have the community generation of content needed. Services like CompuServe that ran on PCs (instead of dedicated terminals) were successful where videotex was not, and ultimately the web wiped out even the services like Compuserve.

What is interesting, however, is how much interest and investment there was around the world in such services. The telecommunications industry clearly saw large-scale interactive information services as the future, but they were wedded to centralized models for how to try and evolve such a service. Only the French got the centralized model right by making it cheap, relatively open, and easy. That it lasted 30 years is an indication of how right Minitel was, even if the internet has replaced it.

 

Digital Infrastructure Summit 2012

A couple of weeks ago I gave a talk at Digital Infrastructure Summit 2012 which was hosted by the Canadian University Council of Chief Information Officers (CUCCIO). This short conference was very different from any other I’ve been at. CUCCIO, by its nature, is a group of people (university CIOs) who are used to doing things. They seemed committed to defining a common research infrastructure for Canadian universities and trying to prototype it. It seemed all the right people were there to start moving in the same direction.

For this talk I prepared a set of questions for auditing whether a university has good support for digital research in the humanities. See Check IT Out!. The idea is that anyone from a researcher to an administrator can use these questions to check out the IT support for humanists.

My conference notes are here.