Detecting human actions is an active topic in computer vision. If successful, it has many important potential applications, from healthcare (for example, monitoring elderly people for falls at home), surveillance (detecting anti-social behavior or someone trying to steal a car), safety monitoring (whether a job is being performed adequately), and ergonomics/forensics (determining whether an action is/was possible).
This very interesting, topical paper is significant in that it attempts to achieve this from a single view, meaning that it can be achieved relatively cheaply--no expensive multi-camera setups with additional processing overheads are required. The paper focuses on modeling relationships between humans and the 3D scene, seeking to infer a human’s pose, and then determining functional constraints (where in the scene can the human operate?) to improve understanding of the behavior and the scene. The main thrust of the paper is that by observing human actions within the scene, estimates of the 3D scene geometry can be significantly improved--an active person in a suitable place can greatly improve scene interpretation. Functional regions are estimated that place bounds (in 3D) on where a person may act. Furthermore, detected actions such as sitting, standing, and reaching enable surfaces within the scene to be classified as walkable, reachable, and so on. Accurate pose estimation is still an issue, which if refined would improve accuracy. Nonetheless, the current results are still impressive.
The research nicely builds upon recent advances in the field. As such, the paper is certainly a good read for any researcher wishing to get a snapshot of the state of the art of this field. The paper has convincing experiments and is well illustrated (there is a good YouTube video available).
Overall, this well-written paper should be accessible to anyone with an interest in scientific programming, image processing, artificial intelligence, or pattern recognition. It highlights an important emerging research area.