A story by Suevon Lee, English Language Hits 1 Billion Words in My Way News talks about the Oxford English Corpus and how their “massive langauge research database … has officially hit a total of 1 billion words…” The corpus is collected by a crawler and is analyzed by Sketch Engine. I can’t figure out if people can access the Oxford Corpus, but one can get an account to try Sketch Engine.
Category: Text Analysis
Mark Olsen: Toward meaningful computing
Mark Olsen and Shlomo Argamon have just published a viewpoint in the Communications of the ACM titled, “Toward meaningful computing” that argues, (among other things)
Current initiatives by Google, Yahoo, and a consortium of European research institutions to digitize the holdings of major research libraries worldwide promise to make the world’s knowledge accessible as never before. Yet in order to completely realize this promise, computer scientists must still develop systems that deal effectively with meaning, not just with data and information. This grand research and development challenge motivates our call here to improve collaboration between computer scientists and scholars in the humanities.
They set an ambitious, but I think, doable agenda for us.
Digital Tools Summit for Linguistics
Announcing an interesting call for position statements for a Digital Tools Summit for Linguistics. The summit will run from June 22nd to the 23rd, 2006 at Michigan State University. The deadline is March 31st, 2006. For more information see DTS-L web site.
Continue reading Digital Tools Summit for Linguistics
Juxta
Juxta has just been released. This is an application for comparing and collating multiple witnesses to a single text. It is open source and has an elegant and clean interface. It was developed at the University of Virginia by Applied Resarch in Patacriticism with funding awarded to Jerome McGann from Mellon.
WuffWuffWare: Analyze Text
WuffWuffWare (yes, I’m serious) has a small text annotation tool for the Mac called, AnalyzeText. It sounds like you can use it like a highlighting and annotating tool, but it also has a concordancer built in. But does it roll over when told to the way my dog does? Tha’s about all the text analysis my dog Leo does.
This is thanks to Alex.
The Gematriculator
Is Gematria text analysis? Alex has pointed me to a Gematriculator that seems to poke fun at the idea by letting you provide a URL or paste text in. I have no idea what the affiliation is of the homokaasu sect. (The FAQ says “homokaasu is Finnish for “gay gas”.)
TagCloud
TagCloud is both a way of showing word or tag frequency and tool for content analysis. TagCloud.com has a tool that I think will give you a tagcloud for placing in your blog. The words are sized by importance and link to lists or related entries. A cool idea of content analysis interface that provides a dynamic folksonomy.
TagCloud.com links to a good article on Folksonomy in the Wikipedia.
Web Crawler: Nutch
Nutch is “open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.” There is a Nutch Wiki with links to news, presentations and articles on it.
Nutch is basically a open Google-like engine that indexes an intranet (or the web) and gives you search capability. This sort of tool could be useful if there were ways to adapt it to discipline specific crawling.
Latent Semantic Analysis
LSA @ CU Boulder is a site at the University of Colorado at Boulder on Latent Semantic Analysis for education. The neat thing is they provide a web interface to different LSA tools. Could these techniques be used in research text analysis? Could we create them as web services?
A site they point to with a list of links to readings, projects and people is Readings in Latent Semantic Analysis, maintained by Lemaire and Dessus.
They also link to a Wired News article on LSA in education that explains how LSA can be used for automatic marking of essays, see Teachers of Tomorrow?.
Continue reading Latent Semantic Analysis
ATLAS.ti
ATLAS.ti! is a “Knowledge Workbench” for the qualitative analysis of texts, images, audio and video. It looks like a PC program that lets you annotate large quantities of materials for interpretation, coding, and clustering.
I saw this years ago, but it has matured and now handles multimedia. I should add that it is for sale, not free, though they have a trial version.
Continue reading ATLAS.ti