The paper makes a systematic and detailed evaluation of two main stages of bag-of-words (BoW) modeling for computer vision. BoW is an accepted technique in computer vision to represent vocabularies of image features. The authors explore the mid-level coding of image descriptors and pooling steps.
The first part of the paper provides a thorough review of four mid-level coding approaches: soft assignment, its extension approximate locality-constrained soft assignment (LcSA), sparse coding (SC), and approximate locality-constrained linear coding (LLC). The authors analyze the accuracy and speed of coding each of these schemes, and suggest several interesting solutions to improve the system’s performance. For instance, minimizing the residual error of approximation of a descriptor vector proves useful for optimally setting the coding parameters. This fact is discussed and expressively illustrated. The authors also propose a fast hierarchical nearest neighbor search based on a compact dictionary of the l-nearest neighbors.
The third section represents an exhaustive exploration of six pooling methods: average, max-pooling, power normalization, theoretical expectation of max-pooling and the probability of at least one particular visual word being present in an image, Lp-norm as a tradeoff between average and max-pooling, and mix-order max-pooling. The paper introduces a new scheme to supplement the max approach, and demonstrates its value by assessing cross vocabulary leakage and descriptor interdependence. The experimental section evaluates the performance of the four mid-level approaches in the framework of the mentioned pooling methods and on a range of datasets. The authors outline the benefits of the suggested improvements for the classification results.
The paper’s sound investigation and proposed solutions represent a significant and valuable contribution. It is appropriate for researchers and designers of image recognition applications.