A World Digital Library Is Coming True!

Robert Darnton has a great essay in The New York Review of Books titled, A World Digital Library Is Coming True! This essay asks about publication and the public interest. He mentions how expensive some journals are getting and how that means that knowledge paid for by the public (through support for research) becomes inaccessible to the very same public which might benefit from the research.

In the US this trend has been counteracted by initiatives to legislate that publicly funded research be made available through some open access venue like PubMed Central. Needless to say lobbyists are fighting such mandates like the Fair Access to Science and Technology Research Act (FASTR).

Darnton concludes that “In the long run, journals can be sustained only through a transformation of the economic basis of academic publishing.” He argues for “flipping” the costs and charging processing fees to those who want to publish.

By creating open-access journals, a flipped system directly benefits the public. Anyone can consult the research free of charge online, and libraries are liberated from the spiraling costs of subscriptions. Of course, the publication expenses do not evaporate miraculously, but they are greatly reduced, especially for nonprofit journals, which do not need to satisfy shareholders. The processing fees, which can run to a thousand dollars or more, depending on the complexities of the text and the process of peer review, can be covered in various ways. They are often included in research grants to scientists, and they are increasingly financed by the author’s university or a group of universities.

While I agree on the need to focus on the public good, I worry that “flipping” will limit who gets published. In STEM fields where most research is funded one can build the cost of processing fees into the funding, but in the humanities where much research is not funded, many colleagues will have to pay out of pocket to get published. Darnton mentions how at Harvard (his institution) they have a program that subsidizes processing fees … they would, and therein lies the problem. Those at wealthy institutions will now have an advantage in that they can afford to publish in an environment where publishers need processing fees while those not subsidized (whether private scholars, alternative academics, or instructors) will have to decide if they really can afford to. Creating an economy where it is not the best ideas that get published but those of an elite caste is not a recipe for the public good.

I imagine Darnton recognizes the need for solutions other than processing fees and, in fact, he goes on to talk about the Digital Public Library of America and OpenEdition Books as initiatives that are making monographs available online for free.

I suspect that what will work in the humanities is finding funding for the editorial and publishing functions of journals as a whole rather than individual articles. We have a number of journals in the digital humanities like Digital Humanities Quarterly where the costs of editing and publishing are borne by individuals like Julian Flanders who have made it a labor of love, their universities that support them, and our scholarly association that provides technical support and some funding. DHQ doesn’t charge processing fees which means that all sorts of people who don’t have access to subsidies can be heard. It would be interesting to poll the authors published and see how many have access to processing fee subsidies. It is bad enough that our conferences are expensive to attend, lets not skew the published record.

Which brings me back to the public good. Darnton ends his essay writing about how the DPLA is networking all sorts of collections together. It is not just providing information as a good, but bringing together smaller collections from public libraries and universities. This is one of the possibilities of the internet – that distributed resources can be networked into greater goods rather than having to be centralized. The DPLA doesn’t need to be THE PUBLIC LIBRARY that replaces all libraries the way Amazon is pushing out book stores. The OpenEdition project goes further and offers infrastructure for publishing knowledge to keep costs down for everyone. A combination of centrally supported infrastructure that is used by editors that get local support (and credit) will make more of a difference than processing fees, be more equitable, and do more for public participation, which is a good too.

Research Data Management Week

This week the University of Alberta is running a Research Data Management Week. They have sessions throughout the week. I will be presenting on “Weaving Data Management into Your Research.” The need for discussions around research data management is described on the web page:

New norms and practices are developing around the management of research data. Canada’s research councils are discussing the introduction of data management plans within their application processes. The University of Alberta’s Research Policy now addresses the stewardship of research records, with an emphasis on the long-term preservation of data. An increasing number of scholarly journals are requiring authors to provide access to the data behind their submissions for publication. Data repositories are being established in domains and institutions to support the sharing and preservation of data. The series of talks and workshops that have been organized will help you better prepare for this emerging global research data ecosystem.

The University now has language in the Research Policy that the University will:

Ensure that principles of stewardship are applied to research records, protecting the integrity of the assets.

The Research Records Stewardship Guidance Procedure then identifies concrete responsibilities of researchers.

These policies and the larger issue of information stewardship have become important to infrastructure. See my blog entry about the TC3+ document on Capitalizing on Big Data.

Research Records Stewardship Guidance Procedure

The University of Alberta has just passed a Research Records Stewardship Guidance Procedure which says that we “are responsible for the stewardship of the research records created, acquired, managed or preserved.” The procedure specifically says,

The Principal Investigator (PI) is responsible for the collection, maintenance, confidentiality, and secure retention of research records until such time as the University may assume responsibility for their management and preservation.

The good news is that we have excellent support in the Library for dealing with research records. We have the Education and Research Archive where we can deposit data. We also have staff in the Digital Initiatives unit of the Library who can help us develop research management plans.

I joined forces with Geoff Harder and Chuck Humphrey to give a presentation on Data Management Plans (my slides).

Pentametron: With algorithms subtle and discrete

Scott send me a link to the Pentametron: With algorithms subtle and discrete / I seek iambic writings to retweet. This site creates iambic pentameter poems from tweets by looking at the rythm of words. It then tries to find ryhming last words to create a AABB rhyming scheme. You can see an article about it on Gawker titled, Weird Internets: The Amazing Found-on-Twitter Sonnets of Pentametron.

Vicar – Access to Abbot TEI-A Conversion!

The brilliant folk at Nebraska and at Northwestern have teamed up to use Abbott and EEBO-MorphAdorner on a collection of TCP-ECCO texts. The Abbot tools is available here, Vicar – Access to Abbot TEI-A Conversion! Abbot tries to convert texts with different forms of markup into a common form. MorphAdorner does part of speech tagging. Together they have made available 2,000 ECCO texts that can be studied together.

I’m still not sure I understand the collaboration completely, but I know from experience that analyzing XML documents can be difficult if each document uses XML differently. Abbot tries to convert XML texts into a common form that preserves as much of the local tagging as possible.

Social Digital Scholarly Editing

On July 11th and 12th I was at a conference in Saskatoon on Social Digital Scholarly Editing. This conference was organized by Peter Robinson and colleagues at the University of Saskatchewan. I kept conference notes here.

I gave a paper on “Social Texts and Social Tools.” My paper argued for text analysis tools as a “reader” of editions. I took the extreme case of big data text mining and what scraping/mining tools want in a text and don’t want in a text. I took this extreme view to challenge the scholarly editing view that the more interpretation you put into an edition the better. Big data wants to automate the process of gathering and mining texts – big data wants “clean” texts that don’t have markup, annotations, metadata and other interventions that can’t be easily removed. The variety of markup in digital humanities projects makes it very hard to clean them.

The response was appreciative of the provocation, but (thankfully) not convinced that big data was the audience of scholarly editors.

Virtual Research Worlds: New Technology in the Humanities – YouTube

The folk at TextGrid have created a neat video about new technology in the humanities, Virtual Research Worlds: New Technology in the Humanities. The video shows the connection between archives and supercomputers (grid computing). At around 2:20 you will see a number of visualizations from Voyant that they have connected into TextGrid. I love the links tools spawning words before a bronze statue. Who is represented by the statue?

Continue reading Virtual Research Worlds: New Technology in the Humanities – YouTube

Tasman: Literary Data Processing

I came across a 1957 article by an IBM scientist, P. Tasman on the methods used in Roberto Busa’s Index Thomisticus project, with the title Literary Data Processing (IBM Journal of Research and Development, 1(3): 249-256.) The article, which is in the third issue of the IBM Journal of Research and Development, has an illustration of how they used punch cards for this project.

Image of Punch Card

While the reproduction is poor, you can read the things encoded on the card for each word:

  • Location in text
  • Special reference mark
  • Word
  • Number of word in text
  • First letter of preceding word
  • First letter of following word
  • Form card number
  • Entry card number

At the end Tasman speculates on how these methods developed on the project could be used in other areas:

Apart from literary analysis, it appears that other areas of documentation such as legal, chemical, medical, scientific, and engineering information are now susceptible to the methods evolved. It is evident, of course, that the transcription of the documents in these other fields necessitates special sets of ground rules and codes in order to provide for information retrieval, and the results will depend entirely upon the degree and refinement of coding and the variety of cross referencing desired.

The indexing and coding techniques developed by this method offer a comparatively fast method of literature searching, and it appears that the machine-searching application may initiate a new era of language engineering. It should certainly lead to improved and more sophisticated techniques for use in libraries, chemical documentation, and abstract preparation, as well as in literary analysis.

Busa’s project may have been more than just the first humanities computing project. It seems to be one of the first projects to use computers in handling textual information and a project that showed the possibilities for searching any sort of literature. I should note that in the issue after the one in which Tasman’s article appears you have an article by H. P. Luhn (developer of the KWIC) on A Statistical Approach to Mechnized Encoding and Searching of Literary Information. (IBM Journal of Research and Development 1(4): 309-317.) Luhn specifically mentions the Tasman article and the concording methods developed as being useful to the larger statistical text mining that he proposes. For IBM researchers Busa’s project was an important first experiment handling unstructured text.

I learned about the Tasman article in a journal paper deposited by Thomas Nelson Winter on Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance. The paper gives an excellent account of Busa’s project and its significance to concording. Well worth the read!