IBM has developed a new architecture for addressing data stored in “plain” text (this will also address data stored in Extensible Markup Language (XML) structured text, so the real issue is that it is not in a fixed-format structured database). IBM has chosen to call this architecture unstructured information management architecture (UIMA), and this paper describes the architecture, and illustrates its use by building a sample application.
The heart of this architecture is a design intended to allow significant reuse of components. People wishing to find data in an unstructured file need to understand the framework, and then work within the framework to extend the portions that do not already exist. As an example, a common problem is to detect words within the text, and it might be that a parser for a given character set (for example, Farsi) does not already exist, so the developer might need to modify an existing parser (for example, one for the Western alphabet) to achieve the same purpose.
Within the architecture, IBM has proposed five different developer roles, and outlined what each of those roles should do. A starting framework that allows the system manager to use existing documents to train (and test) the modules has also been developed.
This paper does a good job of explaining the concepts, components, and roles of the architecture. It isn’t intended as a developer’s reference, nor is it clear that this architecture is intended for use by people not in IBM’s employment. The paper does point out what is necessary to get high reuse of application components, so the process used can be applied to significantly different problems.
As I read this paper, the only criticism that came to mind is one that often applies to information technology (IT) professionals: the acronyms almost get in the way of understanding. Each acronym is explained on first use, but some of the acronyms used in this paper are used differently in other jargons, so it is difficult to keep them straight. On the other hand, if there were no acronyms used, the paper would be significantly longer, so there is no easy solution.
I recommend this paper to anyone working with unstructured text, or anyone who wishes to implement a process with significant reuse of application components.