A flow chart for Busa’s “Mechanized Linguistic Analysis”

Steven Jones has just put up a historic flowchart from the Busa Archive at the Università Cattolica del Sacro Cuore, Milan, Italy. See A flow chart for Busa’s “Mechanized Linguistic Analysis”. Jones has been posting important historical images associated with his book Roberto Busa, S.J., and the Emergence of Humanities Computing. This flow chart shows the logic of the processing using punched cards and tape that was developed by Busa and Paul Tasman (who is probably one of the designers of this chart.) The folks at the Busa Archive had shared this flow chart with me for a paper I gave at the Instant History conference in Chicago on Busa’s Methods. Now Steven has shared it openly with permission.

For more on the Busa Archives and what they show us about the Index Thomisticus as Project see here.

Instant History conference

This weekend I gave a talk at a lovely one day conference on Instant History, The Postwar Digital Humanities and Their Legacies. My conference notes are here. The conference was organized by Paul Eggert, among others. Steve Jones, Ted Underwood and Laura Mandell also talked.

I gave the first talk on “Tremendous Labour: Busa’s Methods” – a paper coming from the work Stéfan Sinclair and I are doing. I talked about the reconstruction of Busa’s Index project. I claimed that Busa and Tasman made two crucial innovations. The first was figuring out how to represent data on punched cards so that it could be processed (the data structures). The second was figuring out how to use the punched card machines at hand to tokenize unstructured text. I walked through what we know about their actual methods and talked about our attempts to replicate them:

I was lucky to have two great respondents (Kyle Roberts and Schlomo Argamon) who both pointed out important contextual issues to consider, as in:

  • We need to pay attention to the Jesuit and spiritual dimensions of Busa’s work.
  • We need to think about the dialectic of those critical of computing and those optimistic about it.

CWRC/CSEC: The Canadian Writing Research Collaboratory

The Canadian Writing Research Collaboratory (CWRC) today launched its Collaboratory. The Collaboratory is a distributed editing environment that allows projects to edit scholarly electronic texts (using CWRC Writer), manage editorial workflows, and publish collections. There are also links to other tools like CWRC Catalogue and Voyant (that I am involved in.) There is an impressive set of projects already featured in CWRC, but it is open to new projects and designed to help them.

Susan Brown deserves a lot of credit for imagining this, writing the CFI (and other) proposals, leading the development and now managing the release. I hope it gets used as it is a fabulous layer of infrastructure designed by scholars for scholars.

One important component in CWRC is CWRC-Writer, an in-browser XML editor that can be hooked into content management systems like the CWRC back-end. It allows for stand-off markup and connects to entity databases for tagging entities in standardized ways.

Text Mining The Novel 2015

novelTMworkshop

On Thursday and Friday (Oct. 22nd and 23rd) I was at the 2nd workshop for the Text Mining the Novel project. My conference notes are here Text Mining The Novel 2015. We had a number of great papers on the issue of genre (this year’s topic.) Here are some general reflections:

  • The obvious weakness of text mining is that it operates on the novel as text, specifically digital text (or string.) We need to find ways to also study the novel as material object (thing), as a social object, as a performance (of the reader), and as an economic object in a market place. Then we also have to find ways to connect these.
  • So many analytical and mining processes depend on bags of words from dictionaries to topics. Is this a problem or a limitation? Can we try to abstract characters, plot, or argument.
  • I was interested in the philosophical discussions around the epistemological in novels and philosophical claims about language and literature.

 

Alain Resnais: Toute la mémoire du monde

Thanks to 3quarksdaily.com I came across the wonderful short film by Alan Resnais, Toute la mémoire du monde (1956). The short is about memory and the Bibliothèque nationale (of France.) It starts at the roof of this fortress of knowledge and travels down through the architecture. It follows a book from when it arrives from a publisher to when it is shelved. It shows another book called by pneumatique to the reading room where it crosses a boundary to be read. All of this with a philosophical narration on information and memory.

The short shows big analogue information infrastructure at its technological and rational best, before digital informatics disrupted the library.

The future of the book: An essay from The Economist

Coverr

The Economist has a nice essay on The future of the book. (Thanks to Lynne for sending this along.) The essay has three interfaces:

  • A listening interface
  • A remediated book interface where you can flip pages
  • A scrolling interface

As much as we have moved beyond skeuomorphic interfaces that carry over design cues from older objects, the book interface is actually attractive. It suits the topic, which is captured in the title of the essay, “From Papyrus to Pixels: The Digital Transformation Has Only Just Begun.”

The content of the essay looks at how books have been remediated over time (from scroll to print) and then discusses the current shifts to ebooks. It points out that the ebook market is not like the digital music market. People still like print books and they don’t like to pick them apart like they do albums. The essay is particularly interesting on the self-publishing phenomenon and how authors are bypassing publishers and stores by publishing through Amazon.

eBookdata

The last chapter talks about audio books, one of the formats of the essay itself, and other formats (like treadmill forms that flash words at speed). This is where they get to the “transformation that has only just begun.”

The Material in Digital Books

Elika Ortega in a talk at Experimental Interfaces for Reading 2.0 mentioned two web sites that gather interesting material traces in digital books. One is The Art of Google Books that gathers interesting scans in Google Books (like the image above).

The other is the site Book Traces where people upload interesting examples of marginal marks. Here is their call for examples:

Readers wrote in their books, and left notes, pictures, letters, flowers, locks of hair, and other things between their pages. We need your help identifying them because many are in danger of being discarded as libraries go digital. Books printed between 1820 and 1923 are at particular risk.  Help us prove the value of maintaining rich print collections in our libraries.

Book Traces also has a Tumblr blog.

Why are these traces important? One reason is that they help us understand what readers were doing and think while reading.

A World Digital Library Is Coming True!

Robert Darnton has a great essay in The New York Review of Books titled, A World Digital Library Is Coming True! This essay asks about publication and the public interest. He mentions how expensive some journals are getting and how that means that knowledge paid for by the public (through support for research) becomes inaccessible to the very same public which might benefit from the research.

In the US this trend has been counteracted by initiatives to legislate that publicly funded research be made available through some open access venue like PubMed Central. Needless to say lobbyists are fighting such mandates like the Fair Access to Science and Technology Research Act (FASTR).

Darnton concludes that “In the long run, journals can be sustained only through a transformation of the economic basis of academic publishing.” He argues for “flipping” the costs and charging processing fees to those who want to publish.

By creating open-access journals, a flipped system directly benefits the public. Anyone can consult the research free of charge online, and libraries are liberated from the spiraling costs of subscriptions. Of course, the publication expenses do not evaporate miraculously, but they are greatly reduced, especially for nonprofit journals, which do not need to satisfy shareholders. The processing fees, which can run to a thousand dollars or more, depending on the complexities of the text and the process of peer review, can be covered in various ways. They are often included in research grants to scientists, and they are increasingly financed by the author’s university or a group of universities.

While I agree on the need to focus on the public good, I worry that “flipping” will limit who gets published. In STEM fields where most research is funded one can build the cost of processing fees into the funding, but in the humanities where much research is not funded, many colleagues will have to pay out of pocket to get published. Darnton mentions how at Harvard (his institution) they have a program that subsidizes processing fees … they would, and therein lies the problem. Those at wealthy institutions will now have an advantage in that they can afford to publish in an environment where publishers need processing fees while those not subsidized (whether private scholars, alternative academics, or instructors) will have to decide if they really can afford to. Creating an economy where it is not the best ideas that get published but those of an elite caste is not a recipe for the public good.

I imagine Darnton recognizes the need for solutions other than processing fees and, in fact, he goes on to talk about the Digital Public Library of America and OpenEdition Books as initiatives that are making monographs available online for free.

I suspect that what will work in the humanities is finding funding for the editorial and publishing functions of journals as a whole rather than individual articles. We have a number of journals in the digital humanities like Digital Humanities Quarterly where the costs of editing and publishing are borne by individuals like Julian Flanders who have made it a labor of love, their universities that support them, and our scholarly association that provides technical support and some funding. DHQ doesn’t charge processing fees which means that all sorts of people who don’t have access to subsidies can be heard. It would be interesting to poll the authors published and see how many have access to processing fee subsidies. It is bad enough that our conferences are expensive to attend, lets not skew the published record.

Which brings me back to the public good. Darnton ends his essay writing about how the DPLA is networking all sorts of collections together. It is not just providing information as a good, but bringing together smaller collections from public libraries and universities. This is one of the possibilities of the internet – that distributed resources can be networked into greater goods rather than having to be centralized. The DPLA doesn’t need to be THE PUBLIC LIBRARY that replaces all libraries the way Amazon is pushing out book stores. The OpenEdition project goes further and offers infrastructure for publishing knowledge to keep costs down for everyone. A combination of centrally supported infrastructure that is used by editors that get local support (and credit) will make more of a difference than processing fees, be more equitable, and do more for public participation, which is a good too.

Research Data Management Week

This week the University of Alberta is running a Research Data Management Week. They have sessions throughout the week. I will be presenting on “Weaving Data Management into Your Research.” The need for discussions around research data management is described on the web page:

New norms and practices are developing around the management of research data. Canada’s research councils are discussing the introduction of data management plans within their application processes. The University of Alberta’s Research Policy now addresses the stewardship of research records, with an emphasis on the long-term preservation of data. An increasing number of scholarly journals are requiring authors to provide access to the data behind their submissions for publication. Data repositories are being established in domains and institutions to support the sharing and preservation of data. The series of talks and workshops that have been organized will help you better prepare for this emerging global research data ecosystem.

The University now has language in the Research Policy that the University will:

Ensure that principles of stewardship are applied to research records, protecting the integrity of the assets.

The Research Records Stewardship Guidance Procedure then identifies concrete responsibilities of researchers.

These policies and the larger issue of information stewardship have become important to infrastructure. See my blog entry about the TC3+ document on Capitalizing on Big Data.