When accessing information on the web, users not only consume the information but also comment on and actively annotate the content, which then generates new content and can also provide ratings. People express themselves on the web through blogs, wikis, forms, and social networks, and give their feedback and opinions on varying topics, including politics, healthcare, products, and travel. Customers seek opinions about products/services from other consumers who have experience with said products/services; useful and influential opinions can be generated via aggregated and accumulated feedback from multiple sources.
This survey paper introduces subjectivity analysis, which includes opinion mining, opinion aggregation, sentiment analysis, and contradiction analysis. Opinion mining uses four approaches: machine learning, dictionary based, statistical, and semantic. This survey covers opinion classification, topic-specific features, sentiment categories/continuous range, target domains, and scale of the algorithm.
Opinions, also called sentiments or emotions, can present as anger, disgust, fear, joy, sadness, or surprise. Opinion mining has two steps: identify topics and classify sentences/documents. Sentiment classification distinguishes between positive, negative, and neutral texts. A sentiment analysis task rates an inference where class labels (often one to five “stars”) represent the polarity of an opinion. Sentiments are aggregated over time and space for a query and can be presented in several dimensions, for example, joy-sadness, acceptance-disgust, anticipation-surprise.
Machine learning methods and annotated datasets have contributed to advancements in opinion mining. Machine learning methods train a model using corpus data; once trained, the model is used to classify the datasets. In fact, a classifier distinguishes among the sentiment labels by analyzing relevant features, which are then used to predict sentiments for new documents.
The machine learning algorithms used are support vector machine (SVM), naive Bayes, and maximum-entropy; SVM performed best. Machine learning is highly sensitive to feature selection, the latter is encoded as present/absent, and the complete document is encoded as a binary vector. The polarity of a sentence/document is determined by averaging the polarity of individual words and sentiments. When there are more than two classes, sentiments are encoded as discrete; continuous form uses scalar values. The latter offers better resolution with finer control, but “is not favored by the classification algorithms.”
Review mining is based on opinion-aggregation obtained by processing, mining, and reasoning on customer feedback data; sentiment polarities are aggregated over frequent features. However, the aggregation has some weak points due to the smoothing of variances in opinions and the manipulating of aggregate values via artificially introduced data, for example, fake reviews. Such manipulation, however, is not well studied yet.
Opinion aggregation may sometimes produce lossy summarization on available opinion data when diversity in the data is ignored. This has necessitated contradiction analysis, that is, the analysis of “features that contribute to a contradiction,” for example, antonymy, negation, and numeric mismatches. Contradiction on a topic may also change over a time, thus time is added as a parameter. Communities in the blogosphere transit between high and low entropy states across time. Entropy also grows when the diversity of opinions grows.
This survey paper provides a concise but thorough introduction to the progress of research in the field of sentiment analysis and opinion mining while sharply distinguishing individual works. It is easy reading--only essential mathematics are included--hence it will be interesting for a large audience. Finally, though the paper is from 2012, the field continues to face many challenges. The application of machine learning in micro-blogging is still new and relevant today.