Research notes taken on subjects around multimedia, electronic texts, and computer games.

Search for:

Common Crawl

The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. Their collection is not as complete as the Internet Archive, which goes back much further, but it is available in large datasets for research.

Posted on November 28, 2017Author GeoffreyRockwellCategories Big Data, Social Networking, Surveillance, Text AnalysisTags Web Archives

Post navigation

Previous Previous post: Oculus and our troubles with (virtual) reality

Next Next post: Plato’s Virtual Reality

Proudly powered by WordPress