Human viewers use context. A blurry box on the road, a location where cars usually are, is probably a car; a blurry, vertical human-sized image near a car is probably a human.
The existing automatic image recognition systems already use elements of context-based reasoning. However, these systems are mostly limited to situations where one object is already recognized, and this recognized object is then used to gauge the size and function of nearby objects. In human understanding, the process is more complex; for example, sometimes we recognize an object as a car because a human stands near it, and sometimes we recognize an object as a human because he or she stands near a car.
The paper begins with a convincing example: two objects that are not individually recognizable, when placed together, are easily recognizable. To automatically understand such images, the authors propose a new integrated technique that simultaneously looks for both three-dimensional (3D) coordinates of the points and the objects. They use a Bayesian approach, with the prior probability obtained by a statistical analysis of actual images. They make an additional assumption that all objects are at ground level. While this assumption may sound too restrictive, the usual human understanding has the same limitation--we easily notice a person at street level, but finding a person up in a tree is a more difficult visual puzzle. The resulting integrated approach works surprisingly well.