Computing Reviews

Towards automatic identification of core concepts in educational resources
Sultan M., Bethard S., Sumner T.  JCDL 2014 (Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, London, UK, Sep 8-12, 2014)379-388,2014.Type:Proceedings
Date Reviewed: 05/27/15

This paper studies the problem of recognizing the degree of similarity between ideas expressed in sentences. In particular, the authors consider the case of science education, where a set of core ideas expressed in sentences exists and one wants to know how close the idea in another sentence is to that core.

The paper proposes a two-phase algorithm to solve this problem. In the first phase, relevant features are extracted from both the core sentence and the sentence being examined. In the second phase, a machine learning classifier, trained by human annotations, produces a label, giving the degree of “coreness” of the sentence being examined. Much of this paper deals with computing the classification features. One computation is based on string similarity, that is, the number of identical words or character sequences. Character sequences of lengths two through five are considered. A second computation is based on semantic similarity, that is, the relationship between meanings in two sentences. Two external resources were used for identifying meanings: ConceptNet [1] and Wikipedia Miner [2].

To compute semantic similarity, the authors weight words by their information content; low-frequency words have greater information content. A third computation measures the probability that the sentence in question would be generated by the set of words in the core set, using a computed probability distribution of words in the core set. Finally, a set of shallow sentence features is computed. The shallow feature used in the classifier is sentence length. In summary, the authors chose to use string similarity, semantic similarity, a generative model, and a shallow feature as input to the classifier.

The metrics used to evaluate performance characteristics of the classifier were: accuracy, precision, recall, and the harmonic mean of precision and recall. The scores, using the chosen metrics, were compared with scores from other identification systems, including that presented by Foster et al. [3] and the COGENT system [4]. The full-featured model described here behaved better: 13% better in accuracy over Foster’s, which is only semi-automated, and 44% better than COGENT. Performance measured by other metrics was similar. The authors claim their approach is computationally reasonable but do not give pertinent details. The results presented in the paper are promising. The intended audience of this work is developers of algorithms for extracting core concepts in written documents.


1)

Liu, H; Singh, P Conceptnet – a practical commonsense reasoning tool-kit. BT Technology J. 22, 4(2004), 211–226.


2)

Milne, D; Witten, I An open-source toolkit for mining Wikipedia . Artificial Intelligence 194, (2013), 222–239.


3)

Foster, J. M; Sultan, M. A; Devaul, H.; Okoye, I.; Sumner, T. Identifying core concepts in educational resources. In Proc. 12th ACM/IEEE-CS Joint Conf. Digital Libraries (New York, ), ACM/IEEE-CS, 2012, 35–42.


4)

de la Chica, S.; Ahmad, F; Martin, J. H; Sumner, T. Pedagogically useful extractive summaries for science education. In Proc. 22nd Int. Conf. Computational Linguistics – Volume 1 (Manchester, England), ), Association for Computational Linguistics, 2008, 177–184.

Reviewer:  B. Hazeltine Review #: CR143474 (1508-0729)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy