Cornell Web Lab: Large scale web research

The Cornell Web Lab is an interesting example of a high performance computing project in the humanities and social sciences. As they say,

The Web Laboratory is a joint project of Cornell University and the Internet Archive to provide data and computing tools for research about the Web and the information on the Web.

In a paper on the project, A Research Library Based on the Historical Collections of the Internet Archive, William Arms and colleagues point out that the data challenge of the social sciences (and humanities) is that the data is poorly structured and there is a lot of it. The Internet Archive is a case in point; as of 2006 they had 5 to 6 petabytes of data of web pages. While it is amazing that we have such archives in computer (and human) readable form, it is hard to do anything with that much. The Web Lab approach is to provide HPC basic services for extracting subsets of the whole that can then be used by other tools.