Latent Semantic Analysis

LSA @ CU Boulder is a site at the University of Colorado at Boulder on Latent Semantic Analysis for education. The neat thing is they provide a web interface to different LSA tools. Could these techniques be used in research text analysis? Could we create them as web services?

A site they point to with a list of links to readings, projects and people is Readings in Latent Semantic Analysis, maintained by Lemaire and Dessus.

They also link to a Wired News article on LSA in education that explains how LSA can be used for automatic marking of essays, see Teachers of Tomorrow?.

Here is a quote from their information page on what LSA is:

Latent Semantic Analysis (LSA) is a mathematical/statistical technique for extracting and representing the similarity of meaning of words and passages by analysis of large bodies of text. It uses singular value decomposition, a general form of factor analysis, to condense a very large matrix of word-by-context data into a much smaller, but still large-typically 100-500 dimensional-representation (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990). The right number of dimensions appears to be crucial; the best values yield up to four times as accurate simulation of human judgments as ordinary co-occurence measures.

The similarity between resulting vectors for words and contexts, as measured by the cosine of their contained angle, has been shown to closely mimic human judgments of meaning similarity and human performance based on such similarity in a variety of ways. For example, after training on about 2,000 pages of English text it scored as well as average test-takers on the synonym portion of TOEFL-the ETS Test of English as a Foreign Language (Landauer & Dumais, 1997). After training on an introductory psychology textbook it achieved a passing score on a multiple-choice exam (Landauer, Foltz & Laham, in prep). LSA significantly improves automatic information retrieval by allowing user requests to find relevant text on a desired topic even when the text contains none of the words used in the query (Dumais, 1991, 1994).