For those wondering why I haven’t been blogging and why Theoreti.ca seems to be unavailable, the answer is that the blog has been hacked and I’m trying to solve the problem. My ISP rightly freezes things when the blog seems to send spam. Sorry about all this!
Category: Uncategorized
Google Starts Grant Program for Studies of Its Digitized Books – Technology – The Chronicle of Higher Education
The Chronicle of Higher Education has a story about a Google Grant Program for Studies of Its Digitized Books. Many of us have been encouraging Google to open Google Books to research projects, including crowdsourcing projects that could improve the content. Google should be congratulated on creating this program and actually providing support for experiments.
Understandably some worry about humanists becoming dependent on Google Books – the worry is that we will “lock-in” our new research practices to one data-set, that of Google. I doubt this is really going to be a problem. It is still early in the development of new analytical practices to see lock-in. Further, the quality of the Google texts is poor (the price of doing it on a large scale was that there was no correction and no markup), that there is room for other data-sets from commercial ones to open projects.
Tagging Full Text Searchable Articles: An Overview of Social Tagging Activity in Historic Australian Newspapers August 2008 – August 2009
D-Lib has an article by Rose Holley of the Australian Newspapers Digitisation Program (ANDP), on Tagging Full Text Searchable Articles: An Overview of Social Tagging Activity in Historic Australian Newspapers August 2008 – August 2009 (January/February 2010, Volume 16, Number 1/2.)
The Australian Newspapers project is a leader in crowdsourcing. They encourage users correct the full text of articles and tag them. This D-Lib article focuses on the tagging and mentions other projects that have researched the effectiveness (and found it wanting compared to professional subject tagging.) The conclusion endorses user tagging,
The observations show that there were both similarities and differences in tagging activity and behaviours across a full text collection as compared to the research done on tagging in image collections. Similarities included that registered users tag more than anonymous users, that distinct tags form 21-37% of the tag pool, that 40% or more of the tag pool is created by ‘super-taggers’ (top 10 tag creators), that abuse of tags occurs rarely if at all, and that spelling mistakes occur fairly frequently if spell-check or other mechanisms are not implemented at the tag creation point. Notable differences were the higher percentage of distinct tags used only once (74% at NLA) and the predominant use of personal names in these tags. This is perhaps related to the type of resource (historic newspaper) rather than its format (full-text). It is likely that this difference may be duplicated if tagging were enabled across archive and manuscript collections. There was an expectation from users that since this was a library service offering tagging, there would be some ‘strict library rules’ for creating tags, and users were surprised there were none. The users quickly developed their own unwritten guidelines. Clay Shirky suggests “Tagging gets better with scale” and libraries have lots of scale – both in content and users. We shouldn’t get too hung up on guidelines and quality. I agree with Shirky that “If there is no shelf, then even imagining that there is one right way to organise things is an error”.
The experience of the National Library of Australia shows that tagging is a good thing, users want it, and it adds more information to data. It costs little to nothing and is relatively easy to implement; therefore, more libraries and archives should just implement it across their entire collections. This is what the National Library of Australia will have done by the end of 2009.
Open-Source, Multitouch Display
Now this is something I want to build – a Open-Source, Multitouch Display, but with a fine wood cabinet.
Research in Support of Digital Libraries at Xerox PARC Part II: Technology
I came across an interesting article in D-Lib that summarizes some of the work at Xerox PARC, Research in Support of Digital Libraries at Xerox PARC Part II: Technology. This is, as the title suggests, is the second part of an extended survey. The article covers some projects on subjects like “visualization of large text collections, summarization, and automatic detection of thematic structure.” There are some interesting examples of citation browsing tools like the Butterfly Citation Browser here.
Another humanities computing centre is dissolved
On Humanist there was an announcement that John Dawson, the Manager of the Literary and Linguistic Computing Centre of Cambridge (LLCC), was retiring and they were having a 45th year celebration conference and retirement party. What the announcement doesn’t say is that with the retirement of Dawson the Cambridge Computing Service is decommissioning the LLCC. I found this on a Computing Service page dedicated to the LLCC:
John Dawson, Manager of the centre will be retiring in October 2009. The LLCC will then cease to exist as a distinct unit, but Rosemary Rodd, current Deputy Manager, will continue to provide support for Humanities computing as a member of the Computing Service’s Technical User Services. She will be based on the New Museums Site.
It seems symptomatic of some shift.
BBC links to other news sites: Moreover Technology
The BBC News has an interesting feature where their stories link to other stories on the same subject from other news sources. See for example the story on, Chavez backer held over TV attack – on the right there are links to stories on the same subject from other news venues like the Philadelphia Inquirer. They even explain why the BBC links to other news sites.
How does it work?
The Newstracker system uses web search technology to identify content from other news websites that relates to a particular BBC story. A news aggregator like Google News or Yahoo News uses this type of technique to compare the text of stories and group similar ones together.
BBC News gets a constantly updating feed of stories from around 4000 different news websites. The feed is provided to us by Moreover Technologies. The company provides a similar service for other clients.
Our system takes the stories and compares their text with the text of our own stories. Where it finds a match, we can provide a link directly from our story to the story on the external site.
Because we do this comparison very regularly, our stories contain links to the most relevant and latest articles appearing on other sites.
Sounds like an interesting use of “real time” text analysis and an alternative to Google News. Could we implement something like that for blogs? The company that provides them with this is Moreover Technologies.
Extracts from original TEI planning proposal
I recently discovered (thanks to a note from Lou Burnard to the TEI list) a document online with extracts from the Funding Proposal for Phase 1 (Planning Conference) for the Text Encoding Initiative which led to the Poughkeepsie conference of 1987 that laid out the plan for the TEI.
The document is an appendix to the 1988 full Proposal for Funding for An Initiative to Formulate Guidelines for the Encoding and Interchange of Machine-Readable Texts. The planning proposal led to the Poughkeepsie conference where consensus was developed that led to the full proposal that funded the initial development of the TEI Guidelines. (Get that?)
The doubled document (the Extracts of the first proposal is an appendix to the 1988 proposal) is fascinating to read 20 years later. In section “3.4 Concrete Results” of the full proposal they describe the outcomes of the full grant thus:
Ultimately, this project will produce a single potentially large document which will:
- define a format for encoded texts, into which texts prepared using other schemes can be translated,
- define a formal metalanguage for the description of encoding schemes,
- describe existing schemes (and the new scheme) formally in that metalanguage and informally in prose,
- recommend the encoding of certain textual features as minimal practice in the encoding of new texts,
- provide specific methods for encoding specific textual features known empirically to be commonly used, and
- provide methods for users to encode features not already provided for, and for the formal definition of these extensions.
I am struck by how the TEI has achieved most of these goals (and others, like a consortial structure for sustainable evolution.) It is also interesting to note what seems to have been done differently, like the second and third bullet points – the development of a “metalanguage for the description of encoding schemes” and “describing existing schemes” with it. I hadn’t thought of the TEI Guidelines as a project to document the variety encoding schemes. Have they done that?
Another interesting wrinkle is in the first proposal extracts where the document talks about “What Text ‘Encoding’ Is”. First of all, why the single quotation marks around “encoding” – was this a new use of the term then? Second, they mention that “typically, a scheme for encoding texts must include:”
Conventions for reducing texts to a single linear sequence wherever footnotes, text-critical apparatus, parallel columns of text (as in polyglot texts), or other complications make the linear sequence problematic.
It is interesting to see linearity creep into what encoding schemes “must” do, including one that is ultimately hierarchical and non-linear. I wonder how to interpret this – is it simply a pragmatic matter of how to you organize the linear sequence of text and code in the TEI document, especially when what you are trying to represent is not linear? Could it be the need for encoded text to be a “string” for the computer to parse? Time to ask someone.
Drawing attention to the things that seem strange obscures the fact that these two proposals were immensely important for digital humanities. They describe how the proposers imagined problems of text representation could be solved by an international project. We can look back and admire the clarity of vision that led to the achievements of the TEI – achievements of not just a few people, but of many organized as per the proposal. These are beautiful (and influential) administrative documents, if we dare say there is such a thing. I would say that they and the Guidelines themselves are some of the most important scholarship in our field.
Sperberg-McQueen: Making a Synthesizer Sound like an Oboe
Michael Sperberg-McQueen has an interesting colloquium paper that I just came across, The State of Computing in the Humanities: Making a Synthesizer Sound like an Oboe. There is an interesting section on “Document Geometries” where he describes different ways we represent texts on a computer from linear ways to typed hierarchies (like TEI.)
The entire TEI Guidelines can be summed up in one phrase, which we can imagine directed at producers of commercial text processing software: “Text is not simple.”
The TEI attempts to make text complex — or, more positively, the TEI enables the electronic representation of text to capture more complexity than is otherwise possible.
The TEI makes a few claims of a less vague nature, too.
- Many levels of text, many types of analysis or interpretation, may coexist in scholarship, and thus must be able to coexist in markup.
- Text varies with its type or genre; for major types the TEI provides distinct base tag sets.
- Text varies with the reader, the use to which it is put, the application software which must process it; the TEI provides a variety of additional tag sets for these.
- Text is linear, but not completely.
- Text is not always in English. It is appalling how many software developers forget this.
None of these claims will surprise any humanist, but some of them may come as a shock to many software developers.
This paper also got me thinking about the obviousness of structure. McQueen criticizes the “tagged linear” geometry (as in COCOA tagged text) thus,
The linear model captures the basic linearity of text; the tagged linear model adds the ability to model, within limits, some non-linear aspects of the text. But it misses another critical characteristic of text. Text has structure, and textual structures can contain other textual structures, which can contain still other structures within themselves. Since as readers we use textual structures to organize text and reduce its apparent complexity, it is a real loss if software is incapable of recognizing structural elements like chapters, sections, and paragraphs and insists instead on presenting text as an undifferentiated mass.
I can’t help asking if text really does have structure or if it is in the eye of the reader. Or perhaps, to be more accurate, if text has structure in the way we mean when we tag text using XML. If I were to naively talk about text structure I would actually be more likely to think of material things like pages, cover, tabs (in files), and so on. I might think of things that visually stand out like sidebars, paragraphs, indentations, coloured text, headings, or page numbers. None of these are really what gets encoded in “structural markup.” Rather what gets encoded is a logic or a structure in the structuralist sense of some underlying “real” structure.
Nonetheless, I think Sperberg-McQueen is onto something about how readers use textual structures and the need to therefore give them similar affordances. I would rephrase the issue as a matter of giving readers affordances with which to manage the complexity and amount of text. A book gives you things like a Table of Contents and Index. An electronic text (or electronic book) doesn’t have to give you exactly the same affordances, but we do need some ways of managing the excess complexity of text. In fact, we should be experimenting with what the computer can do well rather than reimplementing what paper does well. You can’t flip pages on the computer or find a coffee stain near something important, but you can scroll or search for a pattern. The TEI and logical encoding is about introducing computationally useful structure, not reproducing print structures. That’s why pages are so awkward in the TEI.
Update: The original link to the paper doesn’t work now, try this SSOAR Link – they have a PDF. (Thanks to Michael for pointing out the link rot.)
What would Dante think? EA puts sexual bounty on booth babes
Ars Technica has a story about how EA puts sexual bounty on the heads of its own booth babes.
EA has a new way to annoy its own models: give out prizes for Comic Con attendees who commit acts of lust with their booth babes. Also, if you win, you get to take the lady out to dinner! This is going to end well for everyone involved.
All this to promote Dante’s Inferno, their new game. I wish I had the time to identify which circle of hell Dante would have put the marketing idiot who came up with this embarassment.