Given a video and a segmentation algorithm, how does one reconstruct the foreground object from consecutive frames without having any predefined models? This paper describes a solution to this very fundamental and important topic.
To solve this problem, Drouin et al. propose a framework that consists of a tracker, a segmenter, and a modeler. The tracker estimates a new set of pose parameters for every frame, in such a way that the model fits the current frame. The segmenter uses the often-used graph-cut algorithm to create a traditional image segmentation of a video frame. Both tracking and segmentation are used by the modeler, which is mainly responsible for controlling the merging of newly found object parts.
The authors show image sequences of a human and a robot arm. Assuming that the method works as promised, the biggest downside of the technique seems to be the computational efficiency; according to the authors, it is far from real-time processing.
I worked in this field myself, a couple of years ago, and I have never read about such a sharp and well-thought-out idea in this area. I definitely recommend this paper to everybody who wants to work in the field of video-object extraction. Unfortunately, there is no clear benchmark evaluation given that would express both accuracy and robustness under different conditions, using an accepted error measurement.