Computing Reviews

BabyTalk:understanding and generating simple image descriptions
Kulkarni G., Premraj V., Ordonez V., Dhar S., Li S., Choi Y., Berg A., Berg T. IEEE Transactions on Pattern Analysis and Machine Intelligence35(12):2891-2903,2013.Type:Article
Date Reviewed: 02/13/14

Kulkarni et al. propose an automatic image description system based on predicted content using natural language statistics. They claim that the system is able to produce a correct description of an image in the form of natural language sentences, in the same way that humans can. The proposed system contains two stages: computer vision-based detection and recognition following content planning, and the framing of natural language sentences based on predicted content and natural language statistics through surface realization. The paper uses several surface realization steps and evaluates each automatically formed sentence, similar to human interpretation.

The system uses conditional random fields (CRF) to detect image objects, labeling objects such as cars or people as things, and objects such as grass or water as stuff. Each labeled object undergoes predicted content detection and recognition; objects with high detection scores are collected in a group. The attribute classifiers assign image objects to classes based on attributes produced by computer vision-based predicted content, followed by natural language statistics. The prepositional relationship functions that frame the natural sentences use surface realization step encoding, presence, visual attributes, and relationships between objects.

The paper presents image-based and descriptive language-based potential functions using off-the-shelf detectors to identify more objects. For surface realization, the paper presents three sentence-generation techniques: decoding using an n-gram language model, flexible optimization based on integer linear programming (ILP) for handling a wider range of constraints on generation, and a template-based approach.

The experimental evaluation uses the UIUC PASCAL sentence dataset of 847 images, of which 153 are used to set CRF parameters and detection thresholds. The authors compared machine-generated sentences to human-generated sentences using two standard methods: the bilingual evaluation understudy (BLEU) score, measuring modified n-gram precision, and the recall-oriented understudy for gisting evaluation (ROUGE) score, using n-gram recall rather than the precision measure used in BLEU. The authors also performed human subject-based evaluations, in this case, forced choice experiments on the PASCAL sentence dataset using template-based generation. The evaluations show that automatic measures perform better than human competency based on content, grammar, sentence length, and so on. The authors state that the template-based generation scheme produces good results on PASCAL images.

The authors propose including better natural-sounding image descriptions, selective content descriptions, and a capacitating method for more general image content in future work. They also hope to improve the present approach to include actions and scenes to describe video content. Overall, I found reading this paper worthwhile.

Reviewer:  Lalit Saxena Review #: CR141999 (1406-0476)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy