{"id":6651,"date":"2017-11-28T17:24:01","date_gmt":"2017-11-28T17:24:01","guid":{"rendered":"http:\/\/theoreti.ca\/?p=6651"},"modified":"2017-11-28T17:24:01","modified_gmt":"2017-11-28T17:24:01","slug":"common-crawl","status":"publish","type":"post","link":"https:\/\/theoreti.ca\/?p=6651","title":{"rendered":"Common Crawl"},"content":{"rendered":"<p>The <a href=\"http:\/\/commoncrawl.org\/\">Common Crawl<\/a> is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their <a href=\"http:\/\/commoncrawl.org\/2013\/11\/new-crawl-data-available\/\">2013 dataset<\/a> is 102TB and has around 2 billion web pages. Their collection is not as complete as the <a href=\"https:\/\/archive.org\/\">Internet Archive<\/a>, which goes back much further, but it is available in large datasets for research.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Common Crawl is a project that has been crawling the web and making an open corpus of web data from the last 7 years available for research. There crawl corpus is petabytes of data and available as WARCs (Web Archives.) For example, their 2013 dataset is 102TB and has around 2 billion web pages. &hellip; <a href=\"https:\/\/theoreti.ca\/?p=6651\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Common Crawl<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[54,31,60,16],"tags":[75],"class_list":["post-6651","post","type-post","status-publish","format-standard","hentry","category-big-data","category-social-networking","category-surveillance","category-text-analysis","tag-web-archives"],"_links":{"self":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts\/6651","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6651"}],"version-history":[{"count":1,"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts\/6651\/revisions"}],"predecessor-version":[{"id":6652,"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts\/6651\/revisions\/6652"}],"wp:attachment":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6651"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6651"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6651"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}