Advances in Natural Language Generation for Indian Languages

Much of recent progress for natural language generation (NLG) has been in the context of English and, in general, high resource languages, however, Indian languages have yet to see similar paradigm shifts despite their speaking population comprising about a fifth of the world’s population. Two major constraints are data and compute, and in this talk, I will touch on both. I will begin with our earliest work on IndicBART, which leveraged monolingual data and helped overcome resource scarcity of Indian languages as measured on the IndicNLG benchmark. I will then highlight three recent works, two focusing on overcoming data scarcity via mass crawling, cleaning and synthetic data creation with the third focusing on compute scarcity via leveraging romanization alongside an existing strong English LLM. This will hopefully lead to discussions which will help push the boundary of language modeling and NLG for Indian languages.