For knowing whether the same person appears in two images (reidentification), one can either use identification models (that is, classifying the person to its identity) or verification models (that is, classifying whether both the images are of the same person). Both methods have their own advantages and disadvantages. The two models, an identification model and a verification model, are different concerning their inputs, feature extraction, and loss function used to train them. The verification model forces the two images belonging to the same person to be mapped using the nearby points in the resultant feature space. In contrast, the identification network tries to identify the person rather than discriminating it from the other person. A verification neural network does not consider the relationship between the given image pair and other images of the dataset, whereas the identification model tunes different features to classify a person accurately. In the first result, the authors show that just using the verification model is worse than just using the identification model for the reidentification task.
This paper combines the two models to get more discriminative features. Specifically, the authors take some well-known image classification networks (such as CaffeNet, VGG16, and ResNet-50), use input of their last layers as nonlinear embedding functions of the images, and feed these embeddings to two models, simultaneously minimizing identification-loss as well as verification loss. Thus, for a pair of images, the network predicts the identity of the images and whether they belong to the same person. The authors show that their method leads to up to 5 to 11 percent improvement (in different networks) in Rank 1 accuracy compared to using only the identification model, and up to 8 to 21 percent improvement compared to using only the verification loss model.
The paper is well written with a well-articulated problem statement, differences compared to prior work, measurement parameters, results, and different aspects of results. The authors have described both models in detail with their salient points. The authors compare the different loss functions used by the two models: cross-entropy loss (as identification loss) and contrastive loss (as verification loss). The results show that their method achieves 45 percent Rank 1 accuracy even using images from low-resolution cameras. The presentation format of the formulas could have been improved; they were mentioned without any proper explanation that would be useful for a general reader. Overall, though, this is a nice work with good results.