{"id":1064,"date":"2005-11-27T20:01:43","date_gmt":"2005-11-28T00:01:43","guid":{"rendered":"http:\/\/www.theoreti.ca\/?p=1064"},"modified":"2005-11-27T20:01:43","modified_gmt":"2005-11-28T00:01:43","slug":"web-crawler-nutch","status":"publish","type":"post","link":"https:\/\/theoreti.ca\/?p=1064","title":{"rendered":"Web Crawler: Nutch"},"content":{"rendered":"<p><a title=\"Welcome to Nutch!\" href=\"http:\/\/lucene.apache.org\/nutch\/\">Nutch<\/a> is &#8220;open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.&#8221; There is a <a title=\"FrontPage - Nutch Wiki\" href=\"http:\/\/wiki.apache.org\/nutch\/\">Nutch Wiki<\/a> with links to news, presentations and articles on it. <\/p>\n<p>Nutch is basically a open Google-like engine that indexes an intranet (or the web) and gives you search capability. This sort of tool could be useful if there were ways to adapt it to discipline specific crawling.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nutch is &#8220;open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.&#8221; There is a Nutch Wiki with links to news, presentations and articles on it. Nutch is basically a open Google-like engine that indexes an intranet (or the &hellip; <a href=\"https:\/\/theoreti.ca\/?p=1064\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Web Crawler: Nutch<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[],"class_list":["post-1064","post","type-post","status-publish","format-standard","hentry","category-text-analysis"],"_links":{"self":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts\/1064","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1064"}],"version-history":[{"count":0,"href":"https:\/\/theoreti.ca\/index.php?rest_route=\/wp\/v2\/posts\/1064\/revisions"}],"wp:attachment":[{"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1064"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1064"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/theoreti.ca\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1064"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}