Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Texterra: a framework for text analysis
Turdakov D., Astrakhantsev N., Nedumov Y., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Programming and Computing Software40 (5):288-295,2014.Type:Article
Date Reviewed: Jan 8 2015

Turdakov et al. describe the Texterra framework developed at the Institute for System Programming (ISP) of the Russian Academy of Sciences (RAS). Texterra’s intended use is for multi-language text analysis using knowledge extracted from the web. While specific applications have been developed on the basis of Texterra, it is a more general framework that can be adapted to various domains and use cases.

In a brief survey of the field, the authors identify a number of similar frameworks, systems, and libraries, among them OpenNLP, NLTK, Apache UIMA, WordNet, AlchemyAPI, and OpenCalais. As the main distinguishing features, Texterra claims an extensible architecture, an automatically updated knowledge base, the support of several knowledge bases, knowledge-based tools for analyzing lexical semantics, support of several languages, and a high processing rate. In my view, the first four are or should be features exhibited by any modern text analysis system, which leaves the support of multiple languages and performance as main differentiators for Texterra. On the other hand, those features can be quite complex; comparing systems by ticking off their presence or absence may be quite superficial. Concerning languages, the paper discusses Texterra instantiations based on English and Russian; it is unclear if there are others.

The architecture of the Texterra system at a high level is no different from similar ones, consisting of two major parts: one for the processing of natural language texts, the other a knowledge base management tool. In the natural language processing (NLP) part, Texterra uses standard methods (most of them based on the OpenNLP library) in combination with tools that interact with a knowledge base derived from Wikipedia. Language independence is provided by specific morphological tools to generate normal forms of words, and by utilizing the Wikipedia instance for the selected language.

The performance comparison (“operating speed”) is against the DBpedia Spotlight system, although it is not clear which version was used (neither for Texterra nor Spotlight), nor is information on the hardware configuration given. For the analysis of a fairly small corpus (131 English text documents, with about 242 words per document and a total size of 190 kilobytes (Kb)), Texterra achieved 15722 words per second (words/s) versus 5679 for Spotlight in a term recognition task; for word sense disambiguation, the results were similar (Texterra with 13684 and Spotlight with 5824 words/s).

Overall, I found the paper very interesting, and I believe Texterra can be an alternative or additional tool for some NLP-based tasks. Due to its length, the paper offers a relatively short overview of Texterra; I was not able to find additional recent publications on the system. An application programming interface (API) for the system is available on the Institute’s website (https://api.ispras.ru/), which also includes technical details for the representational state transfer (REST) API documentation.

Reviewer:  Franz Kurfess Review #: CR143065 (1504-0307)
Bookmark and Share
  Editor Recommended
Featured Reviewer
 
 
Text Processing (I.5.4 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
General (I.7.0 )
 
 
Learning (I.2.6 )
 
Would you recommend this review?
yes
no
Other reviews under "Text Processing": Date
Recognition of isolated and simply connected hand-written numerals
Shridhar M., Badreldin A. Pattern Recognition 19(1): 1-12, 1986. Type: Article
Nov 1 1987
Heuristic approach to handwritten numeral recognition
Huang J., Chuang K. Pattern Recognition 19(1): 15-19, 1986. Type: Article
Sep 1 1987
A multi-level perception approach to reading cursive script
Srihari S., Božinovic R. Artificial Intelligence 33(2): 217-255, 1987. Type: Article
Jun 1 1988
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy