Text Analysis – Page 27

What is an electronic text?

I came across a thoughtful blog entry responding to somethng I wrote with Ian Lancashire about electronic texts and text analysis in WRT: Writer Response Theory ª Forms of Electronic Texts.

The author, Christy Dena, points out the focus on material characteristics (that an e-text is an electronic version of a written work etc.) and inconsistencies. To be honest I wasn’t trying to come up with a typology with a “continuity of variable.” I was trying to describe the variety of things we call e-texts. Time for a better definition and asking whether we want to use “electronic text” for anything that can be read and has/had an electronic form.

CaSTA Working Papers published

The Working Papers from the First and Second Canadian Symposium on Text Analysis Research (CaSTA) have been published online at the Computing in the Humanities Working Papers. It includes a paper by my on “MIMes and MeRMAids: On the Possibility of Computer-aided Interpretation” which is also coming out through Text Technology.

Unstructured Information Management Architecture (UIMA) from IBM

According to this Reuters article, Search concepts, not keywords, IBM tells business, (August 8th, 2005) IBM is releasing their UIMA SDK (Unstructured Information Management Architecture Software Development Kit) to developers as open-source.
According to an IBM Overview the UIMA provides tools for improving text searching through “analysis technologies, including statistical and rule-based Natural Language Processing (NLP), Information Retrieval (IR), machine learning, and ontologies.” Unstructured information includes not only text, but audio, video and images. This is thanks to Mike Rowse.

http://www.alphaworks.ibm.com/tech/uima/
Continue reading Unstructured Information Management Architecture (UIMA) from IBM

Buzz Engine: Online Analysis

According to a Globe and Mail story, Buzz cuts through the on-line rumour mill, (Jerry Langton, August 4, 2005, Globetechnology Section) Accenture researchers have developed a technology called the Buzz Engine that tracks topics through lists and blogs. It looks like it does something like the culture tracker we developed, graphing the relative frequency of keywords – real-time text analysis.

Here is a quote from Gary Boone, PhD: Weblog:

At Accenture Technology Labs, we have developed the next generation of search engine. Itís a kind of summary engine that focuses on online buzz or discussion. Online Audience Analysis is a buzz engine that interactively shows how much buzz there is on a given topic. You can search for topics of interest and see how much public attention that topic receives. Is anyone talking about the new Xbox? Are more participants in technology discussion sites talking about iPods or about Creative Zen Micros? Online Audience Analysis can show you.

Continue reading Buzz Engine: Online Analysis

Echelon doesn’t seem to work

A story about how the British intelligence services have been closing down al-Qaeda related web sites, Finger points to British intelligence as al-Qaeda websites are wiped out (from the The Sunday Times, July 31st, 2005), comments that automated electronic intelligence gathering systems like Echelon don’t seem to work. In other words text-analysis systems don’t work if people want to subvert them by using simple codes or spamming the net. See my previous posts on Carnivore Documents.
Does this mean it is unlikely to be helpful to students of textuality?

Stemming and Text Analysis

Another candidate for a fundamental procedure for text analysis would be stemming. See What is Stemming? for example algorithms and applications.

Again, is Stemming a basic procedure for humanities computing?

Levenshtein Distance: Fundamental Algorithms in Text Analysis

What are the fundamental algorithms of text analysis? One candidate from computational lignuistics and (CS) might be Levenshtein Distance. This is used in spell checking, speech recognition, and could be used in text analysis in comparison.

But, are there fundamental procedures for literary text analysis? Could the concordance be represented as such a procedure? Or, is the idea of a fundamental algorithm alien to humanities computing?

See also the talk by John Nerbonne who mentions the Levenshtein distance – Nerbonne: Data Deluge.

Text Analysis of E-Mail

StÈfan Sinclair has blogged an interesting story from the New York Times on how Enron Offers an Unlikely Boost to E-Mail Surveillance. Researchers, including Dr. Skillicorn at Queen’s, are using a large collection of Enron e-mail posted by the Federal Energy Regulatory Commission to experiment with e-mail tracking and analysis. A large corpus like the Enron one (over a million messages) can be used as a testbed for social network analysis or diachronic trend analysis. The article also talks about fears that government Echelon-style surveillance of e-mail may become available to corporate intelligence types. I wonder if we can develop useful text analysis tools optimized for e-mail collections like a dialogue of messages on a subject, or the Humanist archives. Some thing for TAPoRware.

Scientists had long theorized that tracking the e-mailing and word usage patterns within a group over time – without ever actually reading a single e-mail – could reveal a lot about what that group was up to. The Enron material gave Mr. Skillicorn’s group and a handful of others a chance to test that theory, by seeing, first of all, if they could spot sudden changes.

For example, would they be able to find the moment when someone’s memos, which were routinely read by a long list of people who never responded, suddenly began generating private responses from some recipients? Could they spot when a new person entered a communications chain, or if old ones were suddenly shut out, and correlate it with something significant?

NITLE: National Institute for Technololgy and Liberal Education: Semantic Indexing

National Institute for Technology and Liberal Education or NITLE (pronounced “nightly”?) have a free semantic indexing tool written in perl that you can download. Their page also has useful starting links on semantic analysis. The project was/is funded by Mellon.
In particular I recommend the introduction to latent semantic indexing they have put up at, Patterns in Unstructured Data: Discovery, Aggregation, and Visualization by Yu, Cuadrado, Ceglowski, and Payne.
Continue reading NITLE: National Institute for Technololgy and Liberal Education: Semantic Indexing

Clusty: Cluster Searching

Clusty the Clustering Engine is a meta-search engine which uses VivÌsimo which is based on technology from Carnegie Mellon. Clusty does a nice job of clustering results from multiple search engines into folders that actually make sense. There are some other neat interface issues that Google could learn from.
They do the clustering by crawling and running some sort of cluster processing on the information. I’m not sure how this works over the engines, though it makes sense over a domain. VivÌsimo also offers enterprise solutions – I wonder if they could be adapted to crawl and cluster humanities texts?
Continue reading Clusty: Cluster Searching