Clusty the Clustering Engine is a meta-search engine which uses VivÌsimo which is based on technology from Carnegie Mellon. Clusty does a nice job of clustering results from multiple search engines into folders that actually make sense. There are some other neat interface issues that Google could learn from.
They do the clustering by crawling and running some sort of cluster processing on the information. I’m not sure how this works over the engines, though it makes sense over a domain. VivÌsimo also offers enterprise solutions – I wonder if they could be adapted to crawl and cluster humanities texts?
Continue reading Clusty: Cluster Searching
Category: Text Analysis
ACH/ALLC Text Analysis Texts
ACH/ALLC Conference 2005 program is now up. Martin Holmes has set up a neat page with access to raw XML and plain text for text analysis of the abstracts. I have been playing around with the text with the TAPoRware Tools. I find the plain text tools work well on the plain text of the prose. Very cool and reflexive in a way that suits our community.
DARPA Global Autonomous Language Exploitation
DARPA seeks strong, responsive proposals from well-qualified sources for a new research and development program called GALE (Global Autonomous Language Exploitation) with the goal of eliminating the need for linguists and analysts and automatically providing relevant, distilled actionable information to military command and personnel in a timely fashion.
Global Autonomous Language Exploitation (GALE) is an unbelievably ambitious DARPA project from the same office that brought us the ARPANET (Information Processing Technology Office.) Imagine if they succeed? Thanks to Greg Crane for pointing this out.
Update – the DARPA Information Processing Technology Office page on GALE is here. Under the GALE Proposer Pamphlet (BAA 05-28) there is a description of the types of discourse that should be processed and the desired results.
Engines must be able to process naturally-occurring speech and text of all the following types:
- Broadcast news (radio, television)
- Talk shows (studio, call-in)
- Newswire
- Newsgroups
- Weblogs
- Telephone conversations
. . .
DARPA’s desired end result includes
- A transcription engine that produces English transcripts with 95% accuracy
- A translation engine producing English text with 95% accuracy
- A distillation engine able to fill knowledge bases with key facts and to deliver useful information as proficiently as humans can.
TADA talk
Here is a blog entry on a short talk I gave about text analysis and collaboration. StÈfan Sinclair had the neat idea of having students enter notes about the conference into a blog on the Text Analysis Developers Alliance as the conference went along.
My talk began by offering a model for how computing practices change interpretation and the role of text analysis. I then went on to talk about different types of interpretation – between developers, between developers and researchers and between researchers.
TADA: Text Analysis Summit Blog
There is a blog on the discussion at the Text Analysis Developers Alliance blog. This is being updated by participants.
Text Analysis Summit
For today and the next two days I am at the Text Analysis Summit that I blogged earlier.
I am typing my notes into a wiki page on a new wiki about text analysis; see wikiTA.
TADA: Text Analysis Summit
My colleague StÈfan Sinclair is organizing a Text Analysis Summit which promises to be great retreat from buzyness.
Software, Tools and Lists for Text Analysis
Software, Tools, Lists, Resources is a good list of resources for computational linguistics. It has a nice list of lists like stop words/function words.
I should check the functionality of these tools against TAPoR.
This came from StÈfan Sinclair.
Using TACT with Electronic Texts (for free)
Using TACT with Electronic Texts, a classic introduction and manual is now available for free as a PDF from the MLA! The MLA and the authors (Ian Lancashire et. al.) should be congratulated for putting this up. Even if you don’t use TACT the opening chapters are relevant to anyone interested in text analysis. Bravo! This is thanks to Judith Altreuter.
EPIC: Carnivore Documents
Omnivore Source Code FOIA Document
Did the FBI build use text analysis for network-tapping? I found an interesting page on the Electronic Privacy Information Centre about Carnivore and Omnivore (its predecessor), two Internet monitoring systems created by the FBI. EPIC has a EPIC Carnivore Page with a summary and scans of documents recieved through Freedom of Information Requests. See also EPIC Carnivore FOIA Documents. The documents are fascinating given all the lines blacked out that you can try to guess at. There is a beauty to these documents with heavy black regions and “Secret” crossed out all over. Note how EPIC uses this aesthetic in their annual report.
Continue reading EPIC: Carnivore Documents