This edited book is an outcome of the European Language Grid (ELG), a European Union (EU)-funded project that formally ran from January 2019 to June 2022 (but note the platform was updated the month this review was written, so the project certainly isn’t dead). The EU is unique in having, on the one hand, many languages each having “official” status in one or more states, but also many “minority” languages. One goal espoused by many is “language equality”: it shouldn’t matter whether one speaks a widely spoken language or a less spoken one. The book’s foreword is written by a champion of this: Jill Evans, a Welsh speaker. (Welsh is spoken by just over 500,000 people in Wales, according to the 2021 census, which is less than 20 percent of the population of Wales and less than one percent of the population of the United Kingdom.)
This diversity of languages means that Europe has a very substantial language technology (LT) industry, but one very largely consisting of disconnected small/medium enterprises. This project’s aim is to become “the primary platform and marketplace for LT in Europe.”
The book spans the range of the project, with parts on cloud infrastructure (five chapters), inventories of technologies and resources (and players) (three chapters), community (four chapters), and pilot projects (16 chapters, but less than a third of the book by page count).
As is inevitable given the timeframe, but also unhelpful, there are no conclusions to the book, and the introductions are very factual.
The technology part is split across several chapters and authors, which leads to a certain amount of repetition. But the overview (chapter 2) is a good introduction. The key point is that the platform is simultaneously a (metadata-driven) catalog, a repository, and an execution environment, aiming, it seems, at a “one-stop shop.” Once this vision is grasped, the rest of the chapter makes sense.
The next few chapters look at the system from different points of view: consumer, provider, and technologist. There is then a chapter on “Interoperable Metadata Bridges to the Wider Language Technology Ecosystem,” or to put it another way: “how do we avoid being another technology island?” The answer seems to be “one bridge at a time.” There are four cases studies:
- CLARIN (Common Language Resources and Technology Infrastructure), which uses the open archives initiative protocol for metadata harvesting (OAI-PMH), and ELG has an OAI-PMH harvester;
- Hugging Face, which is deemed to be important enough to have a custom harvester; however, it can only import those records with ELG’s minimum of license data, which is more than Hugging Face’s minimum;
- Zenodo, which is a catalog with resources from heterogeneous sources and disciplines, and where the automatic import still looks like “work in progress”; and
- Various collaborative community resources, where the contributors seem to have helped with the input.
The pilot projects are quite varied. I noted a medical case study corpus, in English, French, Italian, Spanish, and Basque, where substantial efforts were made to ensure that the annotated layers were roughly equally sized between languages. This should make it much more suitable for research than unbalanced corpora, though alas no such research is reported (there is one since).
Another good pilot project was the building of a voice assistant in Basque. As is noted, the technologies for speech recognition and generation in Basque do exist, but the market is too small to interest the big players. This one is built on the open-source Mycroft AI, and numerous ethical design decisions are described. There’s another virtual assistant project at the end, but no comparison.
I learned a lot about the state of the art from reading this book, and some of the pilot projects gave me ideas for student projects.