Computing Reviews

Texterra:a framework for text analysis
Turdakov D., Astrakhantsev N., Nedumov Y., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Programming and Computing Software40(5):288-295,2014.Type:Article
Date Reviewed: 01/08/15

Turdakov et al. describe the Texterra framework developed at the Institute for System Programming (ISP) of the Russian Academy of Sciences (RAS). Texterra’s intended use is for multi-language text analysis using knowledge extracted from the web. While specific applications have been developed on the basis of Texterra, it is a more general framework that can be adapted to various domains and use cases.

In a brief survey of the field, the authors identify a number of similar frameworks, systems, and libraries, among them OpenNLP, NLTK, Apache UIMA, WordNet, AlchemyAPI, and OpenCalais. As the main distinguishing features, Texterra claims an extensible architecture, an automatically updated knowledge base, the support of several knowledge bases, knowledge-based tools for analyzing lexical semantics, support of several languages, and a high processing rate. In my view, the first four are or should be features exhibited by any modern text analysis system, which leaves the support of multiple languages and performance as main differentiators for Texterra. On the other hand, those features can be quite complex; comparing systems by ticking off their presence or absence may be quite superficial. Concerning languages, the paper discusses Texterra instantiations based on English and Russian; it is unclear if there are others.

The architecture of the Texterra system at a high level is no different from similar ones, consisting of two major parts: one for the processing of natural language texts, the other a knowledge base management tool. In the natural language processing (NLP) part, Texterra uses standard methods (most of them based on the OpenNLP library) in combination with tools that interact with a knowledge base derived from Wikipedia. Language independence is provided by specific morphological tools to generate normal forms of words, and by utilizing the Wikipedia instance for the selected language.

The performance comparison (“operating speed”) is against the DBpedia Spotlight system, although it is not clear which version was used (neither for Texterra nor Spotlight), nor is information on the hardware configuration given. For the analysis of a fairly small corpus (131 English text documents, with about 242 words per document and a total size of 190 kilobytes (Kb)), Texterra achieved 15722 words per second (words/s) versus 5679 for Spotlight in a term recognition task; for word sense disambiguation, the results were similar (Texterra with 13684 and Spotlight with 5824 words/s).

Overall, I found the paper very interesting, and I believe Texterra can be an alternative or additional tool for some NLP-based tasks. Due to its length, the paper offers a relatively short overview of Texterra; I was not able to find additional recent publications on the system. An application programming interface (API) for the system is available on the Institute’s website (https://api.ispras.ru/), which also includes technical details for the representational state transfer (REST) API documentation.

Reviewer:  Franz Kurfess Review #: CR143065 (1504-0307)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy