Software, Tools, Lists, Resources is a good list of resources for computational linguistics. It has a nice list of lists like stop words/function words.
I should check the functionality of these tools against TAPoR.
This came from StÈfan Sinclair.
Category: Text Analysis
Using TACT with Electronic Texts (for free)
Using TACT with Electronic Texts, a classic introduction and manual is now available for free as a PDF from the MLA! The MLA and the authors (Ian Lancashire et. al.) should be congratulated for putting this up. Even if you don’t use TACT the opening chapters are relevant to anyone interested in text analysis. Bravo! This is thanks to Judith Altreuter.
EPIC: Carnivore Documents

Omnivore Source Code FOIA Document
Did the FBI build use text analysis for network-tapping? I found an interesting page on the Electronic Privacy Information Centre about Carnivore and Omnivore (its predecessor), two Internet monitoring systems created by the FBI. EPIC has a EPIC Carnivore Page with a summary and scans of documents recieved through Freedom of Information Requests. See also EPIC Carnivore FOIA Documents. The documents are fascinating given all the lines blacked out that you can try to guess at. There is a beauty to these documents with heavy black regions and “Secret” crossed out all over. Note how EPIC uses this aesthetic in their annual report.
Continue reading EPIC: Carnivore Documents
Comparing: Jack Lynch
Text Analysis with Compare is an essay by Jack Lynch on approaches to comparing texts to find allusions from one to another. It lays out some simple methods and their advantages/disadvantages. I think we are going to try to implement some of these in TAPoRware.
Google vs. Microsoft
What’s Next for Google is an indepth article by Charles H. Ferguson from the January 2005 issue of Technology Review (from MIT.) The article looks at Google and how it might respond if Microsoft seriously decides to dominate the search engine business.
Continue reading Google vs. Microsoft
Text Analysis and Alzheimer’s
Both The Globe and Mail and CBC ran stories about researchers who compared word lists from Iris Murdoch’s books looking at word variety. See CBC News: Iris Murdoch novel may be evidence of Alzheimer’s. Now that computers index our files (a feature in Tiger, for example), could we get them to warn us when our word variety goes down? Could my e-mail client or blog be fitted to alert me to changes in my use of language?
Continue reading Text Analysis and Alzheimer’s
Comparison Engine and Clustering Engine
Antonio Gulli has two interesting tools up on the web. The first is a Rank Comparison Engine, which will query a bunch of search engines, get their list of hits and build a table of points (pills) showing which hits are unique to which index and which shared. The results are interactive, allowing you to mouse-over points to see the short description.
The second is SnakeT Clustering Engine (SNippet Aggregation for Knowledge ExTraction.) It searches various indexes and builds a list of high frequency words that cluster with the query word. You can then navigate by the cooccuring words. Neat use of text analysis for concept exploration.
My one complaint is the design – he needs a graphic designer to make these sing.
Getty Thesaurus of Geographic Names Online (TGN)
The Getty Thesaurus of Geographic Names is a ” a hierarchical vocabulary of around 1.1 million names, and coordinates and other information for around 892,000 geographic places.” (From Getty Vocabularies Download Center)
In other words it is an controlled vocabulary of place names that can be searched online or, with permission, downloaded in XML form (or relational database or MARC.) I wonder if this could be used to create text engines that search by place and use the TGN records (which contain hierarchical information) to provide context? To put it another way, is TGN an ontology?
Continue reading Getty Thesaurus of Geographic Names Online (TGN)
Jason Lewis: ActiveText
At the Textologies workshop organized here at McMaster by Travis Kroeker and Andrew Mactavish, I saw a neat project, ActiveText that was demonstrated by Jason E. Lewis at Concordia. ActiveText is a C++ library that can be used to make active text. Jason has gotten it right – the objects he handles go from glyphs up to passages. They can have behaviors so that segments of text are activated. See the animation.
Continue reading Jason Lewis: ActiveText
Copernic: NRC Summarizing Tools
Copernic is a company that has licensed text summarization technology from the Institute for Information Technology at the National Research Council. They have agent and summarizer tools that can help searching the web and managing results. The Copernic Summarizer, in particular, looks like an interesting application of summarization for everyday use, including the ability to summarize web pages in real time. Neat!
Continue reading Copernic: NRC Summarizing Tools