Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
BabyTalk: understanding and generating simple image descriptions
Kulkarni G., Premraj V., Ordonez V., Dhar S., Li S., Choi Y., Berg A., Berg T. IEEE Transactions on Pattern Analysis and Machine Intelligence35 (12):2891-2903,2013.Type:Article
Date Reviewed: Feb 13 2014

Kulkarni et al. propose an automatic image description system based on predicted content using natural language statistics. They claim that the system is able to produce a correct description of an image in the form of natural language sentences, in the same way that humans can. The proposed system contains two stages: computer vision-based detection and recognition following content planning, and the framing of natural language sentences based on predicted content and natural language statistics through surface realization. The paper uses several surface realization steps and evaluates each automatically formed sentence, similar to human interpretation.

The system uses conditional random fields (CRF) to detect image objects, labeling objects such as cars or people as things, and objects such as grass or water as stuff. Each labeled object undergoes predicted content detection and recognition; objects with high detection scores are collected in a group. The attribute classifiers assign image objects to classes based on attributes produced by computer vision-based predicted content, followed by natural language statistics. The prepositional relationship functions that frame the natural sentences use surface realization step encoding, presence, visual attributes, and relationships between objects.

The paper presents image-based and descriptive language-based potential functions using off-the-shelf detectors to identify more objects. For surface realization, the paper presents three sentence-generation techniques: decoding using an n-gram language model, flexible optimization based on integer linear programming (ILP) for handling a wider range of constraints on generation, and a template-based approach.

The experimental evaluation uses the UIUC PASCAL sentence dataset of 847 images, of which 153 are used to set CRF parameters and detection thresholds. The authors compared machine-generated sentences to human-generated sentences using two standard methods: the bilingual evaluation understudy (BLEU) score, measuring modified n-gram precision, and the recall-oriented understudy for gisting evaluation (ROUGE) score, using n-gram recall rather than the precision measure used in BLEU. The authors also performed human subject-based evaluations, in this case, forced choice experiments on the PASCAL sentence dataset using template-based generation. The evaluations show that automatic measures perform better than human competency based on content, grammar, sentence length, and so on. The authors state that the template-based generation scheme produces good results on PASCAL images.

The authors propose including better natural-sounding image descriptions, selective content descriptions, and a capacitating method for more general image content in future work. They also hope to improve the present approach to include actions and scenes to describe video content. Overall, I found reading this paper worthwhile.

Reviewer:  Lalit Saxena Review #: CR141999 (1406-0476)
Bookmark and Share
 
Computer Vision (I.5.4 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
Natural Language Processing (I.2.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Computer Vision": Date
Machine vision
Vernon D., Prentice-Hall, Inc., Upper Saddle River, NJ, 1991. Type: Book (9780135433980)
Oct 1 1992
The perception of multiple objects
Mozer M., MIT Press, Cambridge, MA, 1991. Type: Book (9780262132701)
Mar 1 1993
Computer vision, models and inspection
Marshall A., Martin R., World Scientific Publishing Co., Inc., River Edge, NJ, 1992. Type: Book (9789810207724)
Jun 1 1993
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy