It is well known that the current types of neural networks perform quite well in identifying faces in (still) images. But what about re-identifying moving pedestrians in non-overlapping video sequences taken from different cameras?
The paper’s novel approach increases the accuracy of re-identification (here: rank-1 recognition rate and mean average precision) by 40.8 percent and 4.2 percent, respectively, compared to the best alternative video-based and image-based algorithms. This is essentially achieved by the following two innovative design choices for their 56-layer (generalized) convolutional neural network with 3.2 million training parameters.
First, the authors compose the whole network of four blocks of so-called dense 3D blocks where each layer in a block is connected in a feed-forward fashion to all(!) subsequent layers in the block (as opposed to just the immediately following layer). This has been chosen to “enlarge the receptive ... neurons in both spatial and temporal dimensions,” enabling the network to discriminate “short-term and long-term motion patterns.”
Second, they extend the classical identification loss function with a second term aiming to minimize center loss, that is, the algorithm tries to maintain the “centers” of training samples from each class and to use this center point for feature embedding.
As is evident from the decidedly technical description above, the paper is geared toward neural network specialists, as even advanced concepts are used without any explanation in the text. Complemented by highly illustrative graphics and even pseudocode for the overall algorithm, the authors rightfully concentrate on providing all the details necessary for (re)building their nontrivial neural network. And while I have not checked this meticulously, I am confident that they are very close to achieving this goal.