Language technology for cultural heritage : selected papers from the LaTeCH Workshop Series
Sporleder C., van den Bosch A., Zervanou K., Springer Publishing Company, Incorporated, New York, NY, 2011. 261 pp.  Type: Book (978-3-642202-26-1)
Date Reviewed: Nov 2 2011

To the humanities scholar studying texts, what is the value of computer technology? First and foremost, especially since the advent of the Web, it provides access to an immense and ever-increasing quantity of digitized primary and secondary material. Second, it allows for searching, using text and metadata, and browsing, both within collections and across the Web. Without getting up from one’s desk, a scholar can now find articles that in the past would have taken weeks of library work to locate.

Third, there is computational analysis over document collections. The range of purposes for which this is useful is incomparably narrower than the first two functionalities; however, where it does apply, it can be a very powerful tool. The central limitation is that, in the current state of the art and for the foreseeable future, reliable information can only be obtained about formal aspects of a text, not its meaning. Thus, one can use natural language processing (NLP) tools for valuable statistical information about linguistic usages, especially if there is a corpus of documents whose linguistic structure is tagged. Also, the authorship of a text can be attributed using statistical stylistic criteria. (Historians can do a statistical study of historical data, but that is more akin to social science analysis.) But these kinds of analysis have little value for most scholars of literature, history, philosophy, and religion, and almost no value for nonacademic readers.

Thus, access and search are the most relevant to humanistic studies. The application of cutting-edge computer science (CS) research to these purposes is limited. I am not at all saying that the goal of digitizing a document collection and constructing a high-quality Web site for it is now attainable with off-the-shelf solutions; manifestly, it is not. However, reaching this goal is, generally, less a question of advanced computer technology than of thoughtful design involving the cooperation of domain specialists, librarians, and computer people. Great Web sites, like JSTOR and the Cambridge Shahnama Project, do not, as far as I can tell, rely on especially powerful computer techniques; instead, they are just superbly well designed.

As regards word-based search, the problem is that the individual researcher cannot compete with Google and Bing; it is almost always more effective to use an advanced Google search restricted to a site than to use the search engine provided by the site. (It may be possible to surpass Google for search in languages other than English; mixed results are reported [1,2].) The one aspect of developing Web sites for document collections that is really fruitful for technical CS research is the use of optical character recognition (OCR) and handwriting recognition to convert the documents to digital form.

Overall, if your objective is the electronic dissemination, preservation, and enhancement of texts as “cultural heritage,” then this is how your budget would be best spent: first, put a lot of content online, in searchable format, with reliably accurate metadata; and second, create high-quality Web sites for important collections.

For this reason, most of the papers in this collection are either unconvincing or not about cultural heritage. The volume begins with a rather Delphic foreword by Willard McCarty and a clear and helpful introduction by the editors. There are two papers on preprocessing: one on using OCR on a collection of 19th century mountaineering yearbooks, and one on the automatic alignment of images of manuscripts with their transcripts. These are legitimately text-focused projects; they make the texts more available, or more usable.

The remaining papers deal with the application of NLP tools to text collections. The best of these explicitly address the interests of language specialists and linguists (texts are primarily raw materials for linguistic analysis). For example, Borin and Forsberg are working on a diachronic lexical resource for 800 years of Swedish, and Bamman and Crane are creating a treebank--a corpus of texts with linguistic tagging--for classical Greek and Latin literature.

The least convincing papers are those that hope to use NLP tools to gain insight into content. For example, Reiter et al. propose applying NLP tools to a corpus of documents dealing with religious rituals, both descriptive (by outsiders) and prescriptive (by priests). There is no reason to suppose that this is a particularly meaningful or interesting collection from a linguistic point of view, and even less reason to suppose that applying NLP tools to these texts will reveal anything important about the nature of rituals.

Finally, a note to the publisher: the last three papers are printed in eight-point type, while the rest are printed in ten-point type. I have rarely seen anything so sloppy in a printed book.

Reviewer:  Ernest Davis Review #: CR139550 (1204-0366)
1) Lewandowski, D. Problems with the use of Web search engines to find results in foreign languages. Online Information Review 32, 5(2008), 668–672.
2) Bar-Ilan, J.; Gutman, T. How do search engines respond to some non-English queries?. Journal of Information Science 31, (2005), 13–28.
