Unstructured Information Management Architecture (UIMA) from IBM

According to this Reuters article, Search concepts, not keywords, IBM tells business, (August 8th, 2005) IBM is releasing their UIMA SDK (Unstructured Information Management Architecture Software Development Kit) to developers as open-source.
According to an IBM Overview the UIMA provides tools for improving text searching through “analysis technologies, including statistical and rule-based Natural Language Processing (NLP), Information Retrieval (IR), machine learning, and ontologies.” Unstructured information includes not only text, but audio, video and images. This is thanks to Mike Rowse.

http://www.alphaworks.ibm.com/tech/uima/

Here is an extended quote from the IBM Overview:

UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines.

How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values. Object types may be related to each other in a single-inheritance hierarchy. Annotators are given a CAS having the subject of analysis (the document), in addition to any previously created objects (from annotators earlier in the pipeline), and they add their own objects to the CAS. The CAS serves as a common data object, shared among the annotators that are assembled for an application.

Many UIM applications analyze entire collections of documents. UIMA supports this analysis through its Collection Processing Architecture. This part of the architecture allows specification of a “source-to-sink” flow from a collection reader though a set of analysis engines and then to a set of CAS Consumers. The collection reader’s job is to connect to and iterate through a source collection, acquiring documents and initializing CASes for analysis. After the analysis engines have added their information to the CAS, CAS consumers do the final CAS processing, for example, sending the CAS contents to a search engine or extracting elements of interest and populating a relational database. A Semantic Search engine is included in the UIMA SDK; it will allow the developer to experiment with indexing analysis results, which will enable semantic searches using the the annotations in the CAS.