Computing Reviews, the leading online review service for computing literature.

Search

Towards language independent automated learning of text categorization models
Apté C., Damerau F., Weiss S. Research and development in information retrieval (Proceedings of the 17th annual international ACM SIGIR conference, Dublin, Ireland, Jul 3-6, 1994)23-30.1994.Type:Proceedings

Date Reviewed: Sep 1 1995

An automatic rule-based method for assigning texts to predetermined categories is described in this straightforward account. A training sample of texts with known classification tags is used to construct rules capable of classifying new unknown texts, based on the occurrence in the texts of particular words and phrases. For example, a text containing the phrase “running back” would be immediately assigned to the category “football”; the same is true of items containing both the words “award” and “player.” Heuristic methods are used to find simple rules that properly separate all classes. The authors have applied their method to samples of articles disseminated by the Reuters newswire service; they studied both English- and German-language articles. They compute evaluation parameters such as the percentage of total documents for a given category that are correctly classified, and find that their system performs with a high degree of accuracy. Moreover, no differences were detected between the English and the German collections, showing that the method may work across many languages. Many competing automatic classification methods have been described in the literature, including systems not based on a construction of formal classification rules. For example, the vocabulary of available test documents can be used to build a class profile (sometimes called a centroid) for each category of interest. Each centroid then consists of a large number of words and phrases plus, possibly, the associated term frequency information. An unknown document can then be classified by comparing its vocabulary globally with all existing centroids, and then assigning the text to the class with the highest matching centroid. Such a method may be more robust than the rule-based approach when the vocabulary changes dynamically over time, because explicit rules are never recorded. The authors prefer the rule-based approach because the rules are “explicitly interpretable” and are “compatible with human-expressed knowledge.” No comparison of the effectiveness of the rule-based classification method with other competing methods is included.

Reviewer: Gerard Salton	Review #: CR118914 (9509-0717)

Information Search And Retrieval (H.3.3 )

Induction (I.2.6 ... )

Text Analysis (I.2.7 ... )

Content Analysis And Indexing (H.3.1 )

Learning (I.2.6 )

Natural Language Processing (I.2.7 )

Would you recommend this review?

yes

Other reviews under "Information Search And Retrieval":	Date

Nested transactions in a combined IRS-DBMS architecture Schek H. (ed) Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings	Nov 1 1985

An integrated fact/document information system for office automation Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article	Oct 1 1985

Access methods for text Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article	Jan 1 1986

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy