Computing Reviews, the leading online review service for computing literature.

Search

Syntactic N-grams as machine learning features for natural language processing
Sidorov G., Velasquez F., Stamatatos E., Gelbukh A., Chanona-Hernández L. Expert Systems with Applications: An International Journal41 (3):853-860,2014.Type:Article

Date Reviewed: Jan 6 2014

Traditionally, n-grams are derived by extracting groups of words as they appear in a text. This paper describes a new way to formulate n-grams, referred to as syntactic n-grams (sn-grams). Sn-grams are made by extracting groups of words not based on how they appear in a text, but rather on how they are presented in grammar parse trees. Derivations from both constituent grammar parses and typed dependency parses are possible. This difference in derivation is interesting and can potentially have an impact on how n-grams are used. N-grams have found plenty of uses. In fact, they are the mainstay of many language models. However, they suffer from two main problems. The first is that of data sparseness, especially in cases where insufficient data is available. The other problem involves stop words, or words that do not hold much meaning. Because of the way n-grams are derived, stop words can make their way into n-grams. The use of sn-grams can potentially overcome, or at least alleviate, both problems. The authors apply sn-grams to the problem of author attribution, the process of identifying the author of a piece of text. Comparing an approach based on sn-grams to one using traditional n-grams, the paper shows that sn-grams demonstrate better performance. It would have been more interesting if the authors had compared sn-grams to other related technologies, such as path features and string kernels, for example. Though not exactly the same, these are common ways to use syntactic information. Furthermore, while sn-grams outperform the use of n-grams for author attribution, the case for the superiority of sn-grams would have been more convincing if either a more state-of-the-art approach to the problem had been used as the comparative baseline [1], or the authors had chosen another problem that better highlights the value of the sn-gram approach. Many advanced approaches to author attribution have been applied with good success (such as the use of topic models). I would suggest that the authors compare the performance of sn-grams to that of conditional random fields (CRF) [2] in a future work. Since the typical CRF classifier uses n-grams, it would be quite exciting if they could show that sn-grams can boost the performance of these CRF classifiers. I definitely found this paper interesting to read. The idea of syntactic n-grams has the potential to be very useful. The paper is also clearly written, and the approach is adequately explained.

Reviewer: Jun-Ping Ng	Review #: CR141864 (1403-0225)

1)	Stamatatos, E.; , A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 3(2009), 538–556.

2)	Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th Int. Conf. on Machine Learning (San Francisco, CA), ), Morgan Kaufmann Publishers Inc., 2001, 282–289.

Learning (I.2.6 )

Natural Language Processing (I.2.7 )

Would you recommend this review?

yes

Other reviews under "Learning":	Date

Learning in parallel networks: simulating learning in a probabilistic system Hinton G. (ed) BYTE 10(4): 265-273, 1985. Type: Article	Nov 1 1985

Macro-operators: a weak method for learning Korf R. Artificial Intelligence 26(1): 35-77, 1985. Type: Article	Feb 1 1986

Inferring (mal) rules from pupils’ protocols Sleeman D. Progress in artificial intelligence (, Orsay, France,391985. Type: Proceedings	Dec 1 1985

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy