Strategies for Corpus Development for Low-Resource Languages

March 2024



Abstract

Datasets or corpora are crucial ingredients for the development of any language technology projects. However, in the majority of situations, these resources appear to be a major issue or bottleneck, especially for low-resource languages. Typically, any low-resource language lacks technological support to encode the script or language computationally. Even for those with such support, the language resources are sparsely developed and lack benchmarking mechanisms, raising the question about the validity of any research and development using those resources. Apparently, it is high time that the low-resource languages develop specific short, medium, and long-term strategies to address these issues so that they could advance research and development of language technologies for their respective languages, at least not falling too much behind, if not at par, with the high-resource languages. This chapter explores the scenario of language computing with a particular focus on the speech and machine translation domains in the context of low-resource languages in Nepal and at the same time, provides a walk-through of experiences working on Nepal's languages and looking into how these could be leveraged or extended to align with the efforts put out by other low-resource languages in the region.


Keywords

NLP AI Deep Learning