The Google Book Search Settlement, if approved by Judge Chin, may be a turning point in textual research. In principle, if the settlement goes through, then Google will release the full 7-10 million books for research (“non-consumptive”) use. Should get even the 500,000 public domain books for research we will have a historic corpus far larger than anything else. To quote the Greg Crane D-Lib article, “What can you do with a million books?” and “What effect will millions of books have on the textual disciplines?”
There is understandably a lot of concerns about the settlement especially about the ownership of orphan works. The American Library Association has a web site on the settlement, as do others. I think we need to also start talking about how to develop a research infrastructure to allow the millions of books to be used effectively. What would it look like? What could we do? Some ideas:
- To be only usable by researchers there would have to be some sort of reasonable firewall.
- It would be nice if it were truly multilingual/multicultural from the start. The books are, after all.
- It would be nice if there was a mechanism for researchers to correct the OCRed text where they see typos. Why couldn’t we clean up the plain text together.
- It would be nice if there was an open architecture search engine scaled to handle the collection and usable by research tools.
Update: Matt pointed me to an article in the Wall Street Journal on Tech’s Bigs Put Google’s Books Deal In Crosshairs.