Wikipedia is one of the noblest manifestations of the spirit of the Information Age. It is also one of the most astonishing. Who could have predicted in 1985 that in 20 years a significant fraction of the world’s population would have at their fingertips a free, high-quality, multilingual encyclopedia, of a size dwarfing the Britannica? That this encyclopedia was built largely by volunteer labor, with no institutional support and no advertisements, makes this notion more far-fetched than the existence of self-driving cars or voice-activated personal assistants.
Wikipedia has also become a major resource for artificial intelligence (AI) research of all kinds, because it combines a number of very useful features:
- It is immense and of high quality, and covers multiple topics.
- Its articles contain explicit presentations of basic information (in contrast to, for example, a corpus of news articles, which almost always assumes this information).
- It is semi-structured. First, the corpus is divided into articles, each of which deals with one specific concept. Second, a number of different features, such as hyperlinks, infoboxes, and category pages, present information in a form that is more nearly standardized than free text, and therefore much easier for automated systems to interpret. This feature is the focus of the papers that are the subject of this review.
- It is multilingual. Versions of Wikipedia exist in almost 300 different languages. In fact, Wikipedia often represents one of the largest high-quality online corpora for many languages that are otherwise underrepresented on the web.
- It was created by a vast number of users with minimal knowledge of computer technology, in contrast to handcrafted ontologies and knowledge bases (such as the CYC project ), which are crafted slowly by expensive experts.
Taking advantage of these features, AI researchers have used Wikipedia as a data resource for a wide range of applications, including semantic relatedness, disambiguation, co-reference resolution, metonymy resolution, query expansion, multilingual retrieval, question answering, entity ranking, text categorization, and ontology and knowledge-base construction. There are now standard AI tasks that are defined in terms of Wikipedia, such as wikification (associating terms in a text with the corresponding Wikipedia article) and the automated construction of infoboxes. A number of knowledge bases that were automatically built from Wikipedia, particularly YAGO  and DBpedia , are now themselves widely used tools.
Medelyan et al.’s paper  is an exceptionally comprehensive and well-written survey of work in this area up to 2008, with a bibliography of approximately 150 research papers. Readers looking for an introduction to the subject should certainly start there.
The papers reviewed here constitute an entire volume of Artificial Intelligence devoted to a variety of more recent projects on this subject. As in most collections like this, the papers are uneven in quality and readability. This can make it difficult for a nonspecialist such as myself to extract the big picture of what has been accomplished and what are the overarching issues. Regrettably, some of the major research groups in this area are not represented, including Etzioni’s group at the University of Washington.
Three papers in the collection seemed to me particularly fine. The overview, “Collaboratively built semi-structure content and artificial intelligence: the story so far,” written by the volume’s editors, is an excellent complement to, and update of, Medelyan et al.’s paper , including both summaries of the papers in this collection and a general survey of work in the area. This paper also presents a series of “take-home messages,” high-level conclusions that provide guidance for future research. “YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia,” by Hoffart et al., shows how the knowledge base YAGO has been extended with spatial and temporal information derived from Wikipedia’s infoboxes. Particularly noteworthy is the paper’s appendix, which contains a collection of questions posed in English, the translation of each question into the query language of YAGO, and the quality of the result obtained. Finally, “An open-source toolkit for mining Wikipedia,” by Milne and Witten, describes the Wikipedia Miner toolkit, which appears to be a very valuable resource.
Overall, the collection is of very great value to both researchers in the area and readers with a general interest in AI. The editors and authors are to be congratulated on an important contribution to the literature on this promising, highly active area of research.