Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Arabic text categorization based on Arabic Wikipedia
Yahya A., Salhi A. ACM Transactions on Asian Language Information Processing13 (1):1-20,2014.Type:Article
Date Reviewed: Jun 6 2014

Text categorization in free-text documents is a much-explored field when the written texts are in English. There is also literature available on the study of, and support for, Middle Eastern and Asian languages. Categorizing texts into predefined categories reduces dimensionality and enhances access and retrieval of the query text in large data sources.

The authors present a percentage and difference categorization (PDC) with a multivalued difference algorithm for categorizing Arabic text based on Arabic Wikipedia. It calculates and compares the percentage (word frequency/total words) of each input word with preexisting word percentages in each category. PDC finds the difference between the words and assigns the word to the category with the smallest difference. It additionally finds multiple difference weighting values of the class, like [0, 1], [0, 0.5, 1], and [0, 0.25, 0.5, 0.75, 1]. According to the authors, their algorithm “started with a simple weighting idea and progressed to a more complex one that considers the relation of weights in the input text and the training data.”

The authors design text preprocessing tools, including a root extractor, light stemmer, and expression extractor, and test them using the proposed algorithm. The root extraction filter iteratively removes prefixes, suffixes, and infixes to find a root for the distinct form. The light stemmer extracts nouns, verbs, and adjectives, mapping prefix/suffix strings to a list, and normalizes the letters. The authors found that the use of light stemming with a reference list of pre-stemmed words was a better tool than general light stemming. The expression extractor extracts lists of single, double, and triple expressions from the input list and ignores expressions with verbs.

According to Yahya and Salhi, splitting the Arabic Wikipedia corpus data to 66 percent training and 34 percent testing, and overlapping the last 34 percent of testing data, leads to better results. They further consider using different data sources to make the tests more reliable for real-world applications.

The paper is worth reading for those studying the categorization of text in English, Arabic, and other languages. Readers can also study and further investigate this area using the references given in the paper.

Reviewer:  Lalit Saxena Review #: CR142366 (1409-0778)
Bookmark and Share
  Reviewer Selected
 
 
Clustering (H.3.3 ... )
 
 
Algorithms (I.5.3 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
Clustering (I.5.3 )
 
 
Natural Language Processing (I.2.7 )
 
 
Document And Text Processing (I.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Clustering": Date
Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases
Can F. (ed), Ozkarahan E. ACM Transactions on Database Systems 15(3): 483-517, 1990. Type: Article
Dec 1 1992
A parallel algorithm for record clustering
Omiecinski E., Scheuermann P. ACM Transactions on Database Systems 15(3): 599-624, 1990. Type: Article
Nov 1 1992
Organization of clustered files for consecutive retrieval
Deogun J., Raghavan V., Tsou T. ACM Transactions on Database Systems 9(4): 646-671, 1984. Type: Article
Jun 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy