Generative artificial intelligence (AI), and specifically large language models (LLMs), are generally trained using huge corpora, in the English language, gathered from the public and private domains. But what about endangered languages, for example, Indigenous ones? Can generative LLMs help in the resurrection of such languages, or what can such technologies offer in their automatic generation of word sequences when faced with input prompts enfolding context delivered in endangered languages? This book addresses many of these questions and tries to deliver some answers.
In fact, this book is a comprehensive volume that delves into the intersection of AI and the preservation of Indigenous and endangered languages. It is composed of 17 chapters, each focusing on different aspects of language revitalization using AI and natural language processing (NLP). The book emphasizes technologies like machine learning, deep learning, and NLP, as well as their role in language preservation.
Part 1 explores the Kuvi, Hehe, Vedda, Ho, and Shi languages. Part 2 then deals with NLP and language analysis, including optical character recognition (OCR) for Indigenous scripts, the creation of parallel corpora (essential for training machine translation systems), and bilingual text collection and alignment. Crowdsourcing, community involvement, and engagement with local communities in language preservation are reviewed. Ethical considerations related to cultural heritage and obtaining informed consent from language communities are discussed. Part 2 also looks at AI learning tools that rely on digital archives and databases, and considers case studies like the Dzongkha language.
The book covers a range of languages and regions; however, there is a noticeable focus on India and Africa. As can be deduced from reading the book, the success of AI-based language revitalization heavily depends on the availability of digital resources and data.
This book is a valuable contribution to the field of language preservation. It offers a detailed exploration of how AI and NLP can be leveraged to save endangered languages, backed by case studies and expert insights. However, the technical complexity and limited practical applications might pose challenges for some readers who are not involved in computational linguistics. Overall, the book is a significant resource for researchers, developers, and policymakers interested in the intersection of technology and language preservation.