High Performance Visualization

Screen shot of visualizationI’m working with the folks at our local HPC consortium, SHARCNET on imagining how we could visualize texts with high resolution displays, 3D displays, and cluster computing. The project, temporarily called The Big See has generated an interested beta version. You can see a video on the process running and images from the final visualization here, Version Beta 2.

One of the unanticipated insights from this project is that the process of building the 3D model, which I will call the *animation*, is as interesting as the final visual model. From the very first version you could see the text flowing up and the high frequency words jostling each other for position. Words would start high and then slide clockwise around. Collocations build up as it goes. We don’t have the animation right, but I think we are on to something. You can see Version B2 as an MP4 animation here.

Now we will start playing with the parameters – colours, transparency, and weight of lines.

Next Steps for E-Science and the Textual Humanities

D-Lib Magazine has a report on next steps for high performance computing (or as they call it in the UK, “e-science”) and the humanities, Next Steps for E-Science, the Textual Humanities and VREs. The report summarizes four presentations on what is next. Some quotes and reactions,

The crucial point they made was that digital libraries are far more than simple digital surrogates of existing conventional libraries. They are, or at least have the potential to be, complex Virtual Research Environments (VREs), but researchers and e-infrastructure providers in the humanities lack the resources to realize this full potential.

I would call this the cyberinfrastructure step, but I’m not sure it will be libraries that lead. Nor am I sure about the “virtual” in research environments. Space matters and real space is so much more high-bandwidth than the virtual. In fact, subsequent papers made something like this point about the shape of the environment to come.

Loretta Auvil form the NCSA is summarized to the effect that Software Environment for the Advancement of Scholarly Research (SEASR) is,

API-driven approach enables analyses run by text mining tools, such as NoraVis (http://www.noraproject.org/description.php) and Featurelens (http://www.cs.umd.edu/hcil/textvis/featurelens/) to be published to web services. This is critical: a VRE that is based on digital library infrastructure will have to include not just text, but software tools that allow users to analyse, retrieve (elements of) and search those texts in ever more sophisticated ways. This requires formal, documented and sharable workflows, and mirrors needs identified in the hard science communities, which are being met by initiatives such as the myExperiment project (http://www.myexperiment.org). A key priority of this project is to implement formal, yet sharable, workflows across different research domains.

While I agree, of course, on the need for tools, I’m not sure it follows that this “requires” us to be able to share workflows. Our data from TAPoR is that it is the simple environment, TAPoRware, that is being used most, not the portal, though simple tools may be a way in to VREs. I’m guessing that the idea of workflows is more of a hypothesis of what will enable the rapid development of domain specific research utilities (where a utility does a task of the domain, while a tool does something more primitive.) Workflows could turn out to be perceived of as domain-specific composite tools rather than flows just as most “primitive” tools have some flow within them. What may happen is that libraries and centres hire programmers to develop workflows for particular teams in consultation with researchers for specific resources, and this is the promise of SEASR. When it crosses the Rubicon of reality it will provide support units a powerful way to rapidly deploying sophisticated research environments. But if it is programmers who do this, will they want a flow model application development environment or default back to something familiar like Java. (What is the research on the success of visual programming environments?)

Boncheva is reported as presenting the Generic Architecture for Text Engineering (GATE).

A key theme of the workshop was the well documented need researchers have to be able to annotate the texts upon which they are working: this is crucial to the research process. The Semantic Annotation Factory Environment (SAFE) by GATE will help annotators, language engineers and curators to deal with the (often tedious) work of SA, as it adds information extraction tools and other means to the annotation environment that make at least parts of the annotation process work automatically. This is known as a ‘factory’, as it will not completely substitute the manual annotation process, but rather complement it with the work of robots that help with the information extraction.

The alternative to the tool model of what humanists need is the annotation environment. John Bradley has been pursuing a version of this with Pliny. It is premised on the view that humanists want to closely markup, annotate, and manipulate smaller collections of texts as they read. Tools have a place, but within a reading environment. GATE is doing something a little different – they are trying to semi-automate linguistic annotation, but their tools could be used in a more exploratory environment.

What I like about this report is we see the three complementary and achievable visions of the next steps in digital humanities:

  • The development of cyberinfrastructure building on the library, but also digital humanities centres.
  • The development of application development frameworks that can create domain-specific interfaces for research that takes advantage of large-scale resources.
  • The development of reading and annotation tools that work with and enhance electronic texts.

I think there is fourth agenda item we need to consider, which is how we will enable reflection on and preservation of the work of the last 40 years. Willard McCarty has asked how we will write the history of humanities computing and I don’t think he means a list of people and dates. I think he means how we will develop from a start-up and unreflective culture to one that one that tries to understand itself in change. That means we need to start documenting and preserving what Julia Flanders has called the craft projects of the first generations which prepared the way for these large scale visions.

Toy Chest (Online or Downloadable Tools for Building Projects)

Alan Liu and others have set up a Knowledge Base for the Department of English at UCSB which includes a neat Toy Chest (Online or Downloadable Tools for Building Projects) for students. The idea is to collect free or very cheap tools students can use and they have done a nice job documenting things.

The idea of a departmental knowledge base is also a good one. I assume the idea is that this can be an informal place for public knowledge faculty, staff and students gather.

netzspannung.org | Archive | Archive Interfaces

Image of Semantic Map

netzspannung.org is a German new media group with an archive of “media art, projects from IT research, and lectures on media theory as well as on aesthetics and art history.” They have a number of interfaces to this archive, for an explanation see, Archive Interfaces. The most interesting is the Java Semantic Map (see picture above.)

netzspannung.org is an Internet platform for artistic production, media projects, and intermedia research. As an interface between media art, media technology and society, it functions as an information pool for artists, designers, computer scientists and cultural scientists. Headed by » Monika Fleischmann and » Wolfgang Strauss, at the » MARS Exploratory Media Lab, interdisciplinary teams of architects, artists, designers, computer scientists, art and media scientists are developing and producing tools and interfaces, artistic projects and events at the interface between art and research. All developments and productions are realised in the context of national and international projects.

See The Semantic Map Interface for more on their Java Web Start archive browser.

Image of Semantic Map

OpenSocial – Google Code

OpenSocial ImageTwo days ago, on the day of All Hallows (All Saints), Google announced OpenSocial a collection of APIs for embedded social applications. Actually much of the online documentation like the first OpenSocial API Blog entry didn’t go up until early in the morning on November 2nd after the Campfire talk. On November 1st they had their rather hokey Campfire One in one of the open spaces in the Googleplex. A sort of Halloween for older boys.

Image from YouTube

Screen from YouTube video. Note the campfire monitors.

OpenSocial, is however important to tool development in the humanities. It provides an open model for the type of energetic development we saw in the summer after the Facebook Platform was launched. If it proves rich enough, it will provide a way digital libraries and online e-text sites can open their interface to research tools developed in the community. It could allow us tool developers to create tools that can easily be added by researchers to their sites – tools that are social and can draw on remote sources of data to mashup with the local text. This could enable an open mashup of information that is at the heart of research. It also gives libraries a way to let in tools like the TAPoR Tool bar. For that matter we might see creative tools coming from out students as they fiddle with the technology in ways we can’t imagine.

The key difference between OpenSocial and the Facebook Platform is that the latter is limited to social applications for Facebook, as brilliant as it is. OpenSocial can be used by any host container or social app builder. Some of the other host sites that have committed to using is are Ning and Slide. Speaking of Ning, Marc Andreessen has the best explanations of the significance of both the Facebook Platform phenomenon and OpenSocial potential in his blog, blog.pmarca.com (gander the other stuff on Ning and OpenSocial too).

Republican Debate: Analyzing the Details – The New York Times

Screen Image The New York Times has created another neat text visualization, this time for the Republican Debate. The visualization has two panels. One shows the video, a transcript, and sections. You can jump the video using the transcript or section outline. The other is a “Transcript Analyzer” where you can see a rich prospect of the debate divided by speeches and you can search for words. What is missing is some sort of overview of what the high frequency words are and how they collocate.

So, I have created a public text for analysis in TAPoR and here are some results. Here is a list of words that are high frequency generated using the List Words tool. Some interesting words:

People (76), Think (66), Know (48), Giuliani (42), Clinton (33), Reagan (13), Democrats (16), Republicans (11)

Health (45), Government (35), Security (35), Country (25), Policy (16), Military (15), School (15),

Marriage (23), Insurance (23), Conservative (23), Private (22), Let (21), Gay (12)

Iraq (13), Iran (12), Turkey (7), Canada (2), Darn (2), Europe (5),

Immigrants (5), Citizens (2)

Man (7), Mean (7), Woman (4), Congressman (25)

Answer (10), Problem (10), Solution (5), War (12)

Continue reading Republican Debate: Analyzing the Details – The New York Times

Plagiarism and The Ecstasy of Influence

Jonathan Lethem had a wonderful essay, The Ecstasy of Influence: A Plagiarism, in the February 2007 Harpers. The twist to the essay, which discusses the copying of words, gift economies, and public commons, was that it was mostly plagiarized – a collage text – something I didn’t realize until I got to the end. The essay challenges our ideas of academic integrity and plagiarism.

In my experience plagiarism has been getting worse with the Internet. There are now web sites like Customessay.org where you can buy customized essays for as low as $12.95 a page. Do the math – a five page paper will probably cost less than the textbook and it won’t get detected by services like Turn It In.

These essay writing companies actually offer to check that the essay you are buying isn’t plagiarized. Here is what Customessay.org says about their Cheat Guru software:

Custom Essay is using the specialized Plagiarism Detection software to prevent instances of plagiarism. Furthermore, we have developed the special client module and made this software accessible to our customers. Many companies claim to utilize the tools of such kind, few of them do and none of them offer their Plagiarism Detection software to their customers. We are sure about the quality of our work and provide our customers with effective tools for its objective assessment. Download and install our Cheat Guru and test the quality of the products you receive from us or elsewhere.

Newspapers have been running stories on plagiarism like JS Online: Internet cheating clicks with students connecting it to ideas from a book by David Callahan, The Cheating Culture (see the archived copy of the Education page that was on his site.)

There is a certain amount of research on plagiarism on the web. A place to start is the The Plagiarism Resource Site or the University of Maryland College’s Center for Intellectual Property page on Plagiarism.

I personally find it easy to catch students who crib from the web by using Google. When I read a shift in writing professionalism I take a sequence of five or so words and Google the phrase in quotations marks. Google will show me the web page the sequence came from. The trick is finding a sequence short enough to not be affected by paraphrasing while long and unique enough to find a web site the student used. This Salon article, “The Web’s plagiarism police” by Andy Dehnart, talks about services and tools that do similar things.

Perhaps the greatest use of these plagiarism catching tools is that they might show us how anything we write is woven out of the words of others. It’s possible these could be adapted to show us the web of connections radiating out from anything written.

Note: This entry was edited in Feb. 2018 to fix broken links. Thanks to Alisa from Plagiarism Check for alerting me to the broken links.

Kirschenbaum: Hamlet.doc?

Matt Kirschenbaum has published an article in The Chronicle of Higher Education titled, Hamlet.doc? Literature in a Digital Age (From the issue of August 17, 2007.) The article nicely summarizes teases us with the question of what we scholars could learn about the writing of Hamlet if Shakespeare had left us his hard-drive. Kirschenbaum has nicely described and theorized the digital archival work humanists will need to learn to do in his forthcoming book from MIT Press, Mechanisms. Here is the conclusion of the Chronicle article,

Literary scholars are going to need to play a role in decisions about what kind of data survive and in what form, much as bibliographers and editors have long been advocates in traditional library settings, where they have opposed policies that tamper with bindings, dust jackets, and other important kinds of material evidence. To this end, the Electronic Literature Organization, based at the Maryland Institute for Technology in the Humanities, is beginning work on a preservation standard known as X-Lit, where the “X-” prefix serves to mark a tripartite relationship among electronic literature’s risk of extinction or obsolescence, the experimental or extreme nature of the material, and the family of Extensible Markup Language technologies that are the technical underpinning of the project. While our focus is on avant-garde literary productions, such literature has essentially been a test bed for a future in which an increasing proportion of documents will be born digital and will take fuller advantage of networked, digital environments. We may no longer have the equivalent of Shakespeare’s hard drive, but we do know that we wish we did, and it is therefore not too late ‚Äî or too early ‚Äî to begin taking steps to make sure we save the born-digital records of the literature of today.