Automated image captioning is the task of generating a useful and accurate description of an image without manual input. An ideal system, when shown an image, generates a summary such as “the mailman is running from a large, scary dog.” Conceptually, the task consists of two parts: understanding the image content, and then expressing it in well-formed natural language. Liu et al.’s survey presents common image captioning methods (as of early 2018).
Earlier methods work on the basis of retrieval: searching a collection of labeled images and forming an image description from the text associated with matching image features. These methods rely heavily on large collections of existing labeled image regions, but have no ability to cope with new objects. Template methods attempt to form sentences that follow certain forms, such as object-relationship-object or “the cow is in the field.” As might be expected, template methods are well suited to images whose primary content is expressible in terms of spatial relationships--in, over, in front of--and less capable of more complex relationships such as emotions.
As in other fields, the availability of deep learning has given rise to newer, more end-to-end methods of image captioning. For example, Google employs a convolutional neural network to transform images into abstract vectors that are mapped to natural language sentences using a recurrent neural network. Specific methods that employ deep learning include neural image caption (NIC), spatial and semantic attention-based models, and a self-adaptive attention model. The specifics of each attention-based model are presented in the paper, and many references contain the necessary details.
Of special importance is the discussion of metrics commonly used for evaluating captioning methods. Clearly, simple correct/incorrect metrics are unsuitable. Commonly used measures include: bilingual evaluation understudy (BLEU), which evaluates the rate of discovered n-grams that also appear in reference captions; METEOR, which considers the matching of words, synonyms, and stems; and ROUGE and CIDEr, which measure the degree of consensus with a collection of manually generated descriptions. Using these various metrics, results are shown for the online MS-COCO test server. The adaptive attention and semantic attention models perform the best, though the gap between them and more limited models such as the template-based methods are surprisingly small. There is much room for improvement. Encouragingly, the models rank similarly by performance for all of the comparison metrics.
The value of a survey paper in a rapidly developing field is to provide a comprehensive list of useful references, a brief discussion of recent developments, and a look into key current issues. The references cover most of the good relevant work; the discussion of the history of image captioning and the current themes is approachable and useful; and the comparison of frequently used metrics is especially helpful. This good survey paper provides a clear marker of the state of the art in automated image captioning.