Video classification is a challenging issue. One of the difficulties is that “there is [a] very limited amount of training data with manual annotations in the video domain.” A two-stream convolutional neural network (CNN), one stream working on a static frame and the other working on temporal motion, has been proposed to tackle this issue. The authors of the paper conduct a comparative study to demonstrate the competitive performance of two-stream CNN against state-of-the-art methods.
The authors explain: “state-of-the-art video classification systems are usually built on top of multiple discriminative feature representations.” These features are usually handcrafted. CNN is a new method in which the features are generated directly from raw data using some form of deep learning. In this study, the authors consider many parameters in CNN, “including [neural] network architectures, model fusion, learning parameters, and the final prediction methods.” The network architecture choices presented in the study include CNN_M and VGG_19; fusion strategies include model fusion and modality fusion; and learning parameters include learning rate, dropout ratio, and the number of training iterations. A combination of these parameters forms the basis of the experimental settings. The authors used two datasets in their experiments. One set of data, UCF-101, “consists of 13,320 video clips”; “there are 101 annotated classes that can be divided into five types.” The other dataset, Columbia Consumer Videos (CCV), “contains 9,317 YouTube videos annotated according to 20 classes.” The results are compared against previous studies on the same datasets, four on each for UCF-101 and CCV. Two-stream CNN performs better than the methods shown in these previous studies, especially for the CCV dataset.
The paper is very well written, and the project described should be of high interest to researchers working in the area of video classification.