centerNet and Google Book Search

centerNet met with a representative from Google Book Search, Jon Orwant, about how Google could support the humanities. I believe there are four levels of collaboration.

Content Curation Interface: We could partner to make possible the careful cleaning and encoding of the books scanned. In most cases the quality of the OCRed text is still poor. It would be nice to have a social layer that allowed people to sign out texts voluntarily to clean them out. We could also help with the selection of editions that are scanned.
Collections Research Interface: Google could make it possible to build tools that let users create research study collections that are subsets of Google Books that can be studied. For this we need access to an API so research portals can access collections not just individual texts. Google will want assurance that those who have access don’t abuse it.
Social Research Tools Interface: We need a way to run tools against texts and collections. We need an API so that tools can be plugged in that can then access texts and collections. Again there is an issue of access. Perhaps Open Social could become a standard for tool plug-ins.
Republication Interface: We need a way to be able create study sites for research groups or courses that make some subset of texts and tools available for a specific purpose.

In all these cases it is clear that Google doesn’t want to read applications, correct lost of texts, or build tools. For that matter none of us know what tools should be written. They see themselves doing smart engineering that creates a platform that enables others who might build layers (research tools, collections portals, and so on) which might be used by others.

John spoke to the centerNet meeting at DH 2009. The motto of Google is to organize the world’s information and make it accessible and useful. The crawl, index, and search the web. One can index and search the world’s books, but it is hard to crawl books (or newpapers or movies.)

There are about 120 millions works in the world and 165 million manifestations. They have an agreement in principle with the publishers that has still not been ruled on. (I think I have that right.) If it is approved in court then Google will be able to some cool things:

Authors/publishers will be able to opt in or out.
If authors/publishers opt in then Google could sell their book if they are still under copyright. They have algorithmic pricing to figure out what to charge.
They could give universities access to the full text of collections of out of date works for a license.
They could create a terminal at every library that has every book that is out of copyright.
They could create a “research corpus” that could be used released for experimentation under a creative commons license. This could be used in contests like T-REX.

John gave some fascinating examples of things his intern has been doing from within the firewall.