Google Book Search Settlement

The Google Book Search Settlement, if approved by Judge Chin, may be a turning point in textual research. In principle, if the settlement goes through, then Google will release the full 7-10 million books for research (“non-consumptive”) use. Should get even the 500,000 public domain books for research we will have a historic corpus far larger than anything else. To quote the Greg Crane D-Lib article, “What can you do with a million books?” and “What effect will millions of books have on the textual disciplines?”

There is understandably a lot of concerns about the settlement especially about the ownership of orphan works. The American Library Association has a web site on the settlement, as do others. I think we need to also start talking about how to develop a research infrastructure to allow the millions of books to be used effectively. What would it look like? What could we do? Some ideas:

To be only usable by researchers there would have to be some sort of reasonable firewall.
It would be nice if it were truly multilingual/multicultural from the start. The books are, after all.
It would be nice if there was a mechanism for researchers to correct the OCRed text where they see typos. Why couldn’t we clean up the plain text together.
It would be nice if there was an open architecture search engine scaled to handle the collection and usable by research tools.

Update: Matt pointed me to an article in the Wall Street Journal on Tech’s Bigs Put Google’s Books Deal In Crosshairs.