Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Towards language independent automated learning of text categorization models
Apté C., Damerau F., Weiss S.  Research and development in information retrieval (Proceedings of the 17th annual international ACM SIGIR conference, Dublin, Ireland, Jul 3-6, 1994)23-30.1994.Type:Proceedings
Date Reviewed: Sep 1 1995

An automatic rule-based method for assigning texts to predetermined categories is described in this straightforward account. A training sample of texts with known classification tags is used to construct rules capable of classifying new unknown texts, based on the occurrence in the texts of particular words and phrases. For example, a text containing the phrase “running back” would be immediately assigned to the category “football”; the same is true of items containing both the words “award” and “player.” Heuristic methods are used to find simple rules that properly separate all classes.

The authors have applied their method to samples of articles disseminated by the Reuters newswire service; they studied both English- and German-language articles. They compute evaluation parameters such as the percentage of total documents for a given category that are correctly classified, and find that their system performs with a high degree of accuracy. Moreover, no differences were detected between the English and the German collections, showing that the method may work across many languages.

Many competing automatic classification methods have been described in the literature, including systems not based on a construction of formal classification rules. For example, the vocabulary of available test documents can be used to build a class profile (sometimes called a centroid) for each category of interest. Each centroid then consists of a large number of words and phrases plus, possibly, the associated term frequency information. An unknown document can then be classified by comparing its vocabulary globally with all existing centroids, and then assigning the text to the class with the highest matching centroid. Such a method may be more robust than the rule-based approach when the vocabulary changes dynamically over time, because explicit rules are never recorded. The authors prefer the rule-based approach because the rules are “explicitly interpretable” and are “compatible with human-expressed knowledge.”

No comparison of the effectiveness of the rule-based classification method with other competing methods is included.

Reviewer:  Gerard Salton Review #: CR118914 (9509-0717)
Bookmark and Share
 
Information Search And Retrieval (H.3.3 )
 
 
Induction (I.2.6 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
Content Analysis And Indexing (H.3.1 )
 
 
Learning (I.2.6 )
 
 
Natural Language Processing (I.2.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Information Search And Retrieval": Date
Nested transactions in a combined IRS-DBMS architecture
Schek H. (ed)  Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings
Nov 1 1985
An integrated fact/document information system for office automation
Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article
Oct 1 1985
Access methods for text
Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article
Jan 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy