Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Syntactic N-grams as machine learning features for natural language processing
Sidorov G., Velasquez F., Stamatatos E., Gelbukh A., Chanona-Hernández L. Expert Systems with Applications: An International Journal41 (3):853-860,2014.Type:Article
Date Reviewed: Jan 6 2014

Traditionally, n-grams are derived by extracting groups of words as they appear in a text. This paper describes a new way to formulate n-grams, referred to as syntactic n-grams (sn-grams). Sn-grams are made by extracting groups of words not based on how they appear in a text, but rather on how they are presented in grammar parse trees. Derivations from both constituent grammar parses and typed dependency parses are possible. This difference in derivation is interesting and can potentially have an impact on how n-grams are used.

N-grams have found plenty of uses. In fact, they are the mainstay of many language models. However, they suffer from two main problems. The first is that of data sparseness, especially in cases where insufficient data is available. The other problem involves stop words, or words that do not hold much meaning. Because of the way n-grams are derived, stop words can make their way into n-grams. The use of sn-grams can potentially overcome, or at least alleviate, both problems.

The authors apply sn-grams to the problem of author attribution, the process of identifying the author of a piece of text. Comparing an approach based on sn-grams to one using traditional n-grams, the paper shows that sn-grams demonstrate better performance.

It would have been more interesting if the authors had compared sn-grams to other related technologies, such as path features and string kernels, for example. Though not exactly the same, these are common ways to use syntactic information.

Furthermore, while sn-grams outperform the use of n-grams for author attribution, the case for the superiority of sn-grams would have been more convincing if either a more state-of-the-art approach to the problem had been used as the comparative baseline [1], or the authors had chosen another problem that better highlights the value of the sn-gram approach. Many advanced approaches to author attribution have been applied with good success (such as the use of topic models).

I would suggest that the authors compare the performance of sn-grams to that of conditional random fields (CRF) [2] in a future work. Since the typical CRF classifier uses n-grams, it would be quite exciting if they could show that sn-grams can boost the performance of these CRF classifiers.

I definitely found this paper interesting to read. The idea of syntactic n-grams has the potential to be very useful. The paper is also clearly written, and the approach is adequately explained.

Reviewer:  Jun-Ping Ng Review #: CR141864 (1403-0225)
1) Stamatatos, E.; , A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 3(2009), 538–556.
2) Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th Int. Conf. on Machine Learning (San Francisco, CA), ), Morgan Kaufmann Publishers Inc., 2001, 282–289.
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Learning (I.2.6 )
 
 
Natural Language Processing (I.2.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Learning": Date
Learning in parallel networks: simulating learning in a probabilistic system
Hinton G. (ed) BYTE 10(4): 265-273, 1985. Type: Article
Nov 1 1985
Macro-operators: a weak method for learning
Korf R. Artificial Intelligence 26(1): 35-77, 1985. Type: Article
Feb 1 1986
Inferring (mal) rules from pupils’ protocols
Sleeman D.  Progress in artificial intelligence (, Orsay, France,391985. Type: Proceedings
Dec 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy