Text Technology and TAPoR – Page 2

Supporting Digital Scholarship

The Tri-Council Agencies (Research councils of Canada) and selected other institutions (going under the rubric TC3+) have released an important Consultation Document titled Capitalizing on Big Data: Toward a Policy Framework for Advancing Digital Scholarship in Canada. You can see a summary blog entry from the CommerceLab, How big data is reshaping the future of digital scholarship in Canada. The document suggest that we have many of the components of a “well-functioning digital infrastructure ecosystem for research and innovation”, but that these are not coordinated and Canada is not keeping up. They propose three initiatives:

Establishing a Culture of Stewardship
Coordination of Stakeholder Engagement
Developing Capacity and Future Funding Parameters

The first initiative is about research data management and something we have been working on the digital humanities for some time. It is great to see a call from our funding agencies.

Tool Discourse

We are finally getting results in a long slow process of trying to study tool discourse in the digital humanities. Amy Dyrbe and Ryan Chartier are building a corpus of discourse around tools that includes tool reviews, articles about what people are doing with tools, web pages about tools and so on. We took the first coherent chunk and Ryan has been analyzing it with R. The graph above shows which years have the most characters. My hypothesis was that tool reviews and discourse dropped off in the 1990s as the web became more important. This seems to be wrong.

Here are the high-frequency words (with stop words removed). Note the modal verbs “can”, “will”, and “may.” They indicate the potentiality of tools.

“can” 2305
“one” 1996
“text” 1940
“word” 1931
“words” 1859
“program” 1606
“ii” 1514 (Not sure why)
“will” 1361
“language” 1307
“data” 1285
“two” 1188
“system” 1183
“computer” 1116
“used” 1115
“use” 942
“user” 939
“file” 890
“first” 870
“may” 853
“also” 837

Crime’s Digital Past – Science News

Tim sent me a link to another news story on the Criminal Intent project that I am part of. This one is in Science News and is titled, Crime’s Digital Past. The article in by Bruce Bower and dated July 30th, 2011 (which, I know, is in the future.) One of the better stories.

Google Ngram Viewer

Google has release a neat new tool that uses their Google Books database. The Google Ngram Viewer lets you plot the relative frequencies of words and phrases over time.

Information about the tool can be found at, http://ngrams.googlelabs.com/info.

The graph above shows truth (blue) graphed against false (red).

Digitization Day

The CIRCA Histories and Archives group I am part of is organizing the University of Alberta’s first Digitization Day.

This one-day event is a chance for research projects that are digitizing evidence to meet up with each other and with units on campus that provide relevant research services. Projects that are creating digital archives of different sorts will give short presentations as will units on campus that support research.

The idea is to bring a lot of digitization projects together to learn about each other and what is happening on campus. My sense is that we have hit a critical mass on campus and now that we have a trusted digital repository ERA (Education and Research Archive) it is time to start talking and sharing knowledge. Each project should not have to reinvent itself.

TAPoR portal has moved

The TAPoR Portal has moved to a new server at the University of Alberta. The new location will allow us here to start redesigning it and developing version 2.0. (Or is it now version 3.0?) I underestimated how much work it is to move something so complex. We had to work on bugs, we had to warn users, we had to set up hardware here. Kamal Ranaweera worked very hard to do this – Bravo!

Some links related to the move:

If you have trouble with the portal go to http://tada.mcmaster.ca/Main/TAPoRPortalMove for information
If you are interested in the redesign go to http://tada.mcmaster.ca/Main/TAPoRRedesign

Towards a Methods Commons

Well my vacation is over and I’m facilitating a retreat on text methods across disciplines. (See Towards a Methods Commons.) With support from the ITST program at SSHRC we brought together 15 linguists, philosophers, historians, and literary scholars to discuss methods in a structured way. The goal is to sketch a commons that gathers “recipes” that show people how to do research things with electronic texts. Stay tuned for a draft web site in about 6 months.

Chronologie des supports, des dispositifs spatiaux, des outils de repérage de l’information

Christian directed me to a fascinating chronology of information technology (in French) by Sylvie Fayet-Scribe. It is called Chronologie des supports, des dispositifs spatiaux, des outils de repérage de l’information. and the web design isn’t the best, but it seems detailed and annotated. It seems like a good place to start if you want to understand the types of information aides from encyclopedias, indexes, and so on. Here division of time into epochs is also interesting. The bibliography is also good.

Adonis Meeting

I was a meeting organized by the Adonis project (See TGE Adonis | Très grand équipement du CNRS pour les sciences humaines et sociales) to look at international collaboration. Adonis is running a number interesting projects:

Revues.org is a platform for e-journals in France.
Calenda is a shared calendar of events for French academics.
Hypotheses is a shared blog environment for news about projects.
Lodel is their content management system for publications.

Some other projects mentioned were:

Plume hosts and lets people discover open source software from university research projects.
SourceSup is a project management and code versioning environment for academic projects.

We are struggling with issues of international collaboration, archiving data, interoperation and so on. We all see the value to large national (or international) digital archives, but the funding is oriented to projects and not long-term archiving. Some of the issues that came up:

Lou Burnard made an important distinciton between archiving and backup. A lot of people want backup for their work or their project and think that archiving services will provide this; they don’t really understand that backup is not archiving. That doesn’t mean that backup isn’t important. Apparently in the student riots in Paris last year a number of computers with irreplaceable data were destroyed.
The limitations of centralized solutions. We are all tempted by the thought of long-term central funding to run services, but there are dangers to such centralization. If central funding is cut or shifted (as happened with the AHDS) then everything disappears. Can we imagine decentralized solutions? Would they work? I’d like to see more social research initiatives that support decentralized solutions. I think in the current economic climate we have to explore these.
David Robey made the point that we have to do a better job of explaining the value of digital resources and services. We need to educate ourselves to gather evidence of value and that includes the opportunity costs.
Paolo D’Ivorio argued that there are certain primitive functions that scholarly systems need including Citation (reliable ways to point to other works), Consensus (agreement in a field as to what is of value and how to assess that), and Discovery/Dissemination (ways of finding and getting at scholarship.)

You can follow some of the meeting is you search Twitter for #ADONIS.