Web Crawler: Nutch

Nutch is “open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.” There is a Nutch Wiki with links to news, presentations and articles on it.

Nutch is basically a open Google-like engine that indexes an intranet (or the web) and gives you search capability. This sort of tool could be useful if there were ways to adapt it to discipline specific crawling.