Research done at MIT to develop a computer system that can locate a speaker in a scene, and determine to whom they are speaking, is described in this paper. No intended application is described, but we can guess there will be uses in security systems, automated TV studios, robotic systems, or clandestine spy activities.
The system includes dual stereo cameras, and microphones in a fixed position. To calibrate the system, a person speaks directly to the cameras and microphones. The function of the audio is to determine which person is speaking. The functions of the video are to identify the speaker, by identifying moving lips, and to determine to whom they are speaking.
The direction of speaking is derived by sampling the one- through six-kilohertz (KHZ) band of each microphone independently. The system then computes a time difference correlation to determine speaker direction. The video can be used to locate a face, to locate face movements derived from speech, and to determine a face’s orientation (speaking direction). The system can then correlate the speaking face and the sound. The computations are described using statistical formulas of the correlation of Gaussian density, and distribution functions from the audio and video inputs. It would have been helpful if the authors included a graphical representation of the distribution functions, to demonstrate the microphone and video inputs and the formula output.
Research results show that this level of system is satisfactory if the speaker stands out from any clutter, and is facing somewhat toward the camera. Future research will seek to enhance the system using a speech recognizer and human facial model.