Annotation of multimedia data is a very topical yet very challenging problem. Indeed, Web sites such as YouTube store terabytes or even petabytes of video data. To successfully enable user navigation or retrieval in these huge databases, some automatic processes are mandatory. Among them, automatic video annotation seeks to label each video with some predefined concepts, which will then be sought by the end user.
In this paper, the authors propose a new paradigm for video annotation in order to deal more effectively with multi-label annotation (since a video usually concerns several topics). Instead of using a set of binary classifiers related to each individual concept, or merging these binary classifiers in a post-processing process called context-based conceptual fusion, they introduce an integrated multi-label approach that explicitly models both concepts themselves and concepts’ interactions, to avoid relying on premature decisions made by binary classifiers. Their experiments, performed on the Text Retrieval Conference Video Retrieval Evaluation (TRECVID) dataset, show the relevance of this approach, but also point to the need for efficient algorithms, since the proposed solution is still very far from real time.
To deliver automatic solutions to the market, research efforts should now focus not only on the reliability of automatic annotation solutions, but also (and perhaps more) on the efficiency of these systems. This issue should be tackled more often by research teams in the field of multimedia indexing and retrieval.