A concept-based index for a collection of videos can be constructed using a set of statistical detectors for a fixed set of semantic concepts. The resulting index uses only the limited number of concepts based on the original set of semantic concepts; however, the index can be enhanced by using a corpus, such as the Brown corpus, and WordNet to map a larger set of concepts onto the smaller set used to create the original index. This mapping is derived in part from measuring the information content of each concept that in turn is derived from the original corpus. The size and coverage of the corpus is therefore a critical ingredient of the process.
Given the age of the Brown corpus, there are a number of deficiencies related to its use because many of the words in WordNet do not appear in the corpus. This paper proposes two methods for constructing a new assignment of information content to concepts from WordNet. In the first, a corpus is constructed by taking for each concept in WordNet the first ten documents retrieved by Google. In the second method, the number of pages found by Google for each word serves as the basis for computing the information content of the word; thus, the Google knowledge base is in some sense the corpus.
These approaches are used to construct enhanced context indexes for the Text Retrieval Conference Video (TRECVID) retrieval datasets that show substantially better performance than the systems based on the Brown corpus. The effectiveness of the “pages retrieved” count strategy is a particularly striking example of the use of the Web as a resource for large-scale data.