This fascinating and important survey offers a summary of various research efforts currently underway to achieve a higher degree of automatic or semiautomatic indexing of video. With the explosion of video online, there is certainly an increasing need to be able to find specific scenes out of many thousands of frames, not only from professional communities, but also from consumers. While major strides have been made over the past decade using techniques that center on just the video itself, you can do even better if you use not just the sequence of video frames, but also anything else that might be available.
Rather than just creating a dictionary-type list enumerating various techniques, this paper’s authors have set up a framework for the discussion of various approaches. Their view is that, when indexing, you need to take the viewpoint of the author of the video. They point out that, for example, if the author thought it was important to have audio as well in a specific setting, then the modes of audio and setting can be used as clues for more effective indexing. They talk about answering the “what” to index as well as the “which” to index questions.
After presenting their framework, the authors go on to discuss segmenting the video document, and analyzing and using the multiple modalities. Finally, they present a summary of open research questions for further work, all in the context of their framework. They stress that this multimodal technique is the future.
This is a summary survey paper, which is primarily textual, with a few tables and diagrams. It is descriptive without complexity or mathematical algorithms. It represents a good start for further work in this significant field.