Microsoft Research Africa is pioneering advancements in Generative AI through our African Languages Research initiative, focusing on advancing Large Language Models (LLMs) for African languages. Our goal is to enhance language accessibility and ensure AI tools can effectively engage with Africa’s linguistic diversity, supporting both language preservation and the integration of African languages into the digital world.
Current work
The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. Currently at MSR Africa, we are evaluating the performance of leading LLMs, including OpenAI’s GPT series, LLMs from Meta, Mistral AI, and Google. Our study employs a distinctive WhatsApp dataset originally collected as part of a health research project run by the University of Washington’s Global Health Department. It entails multilingual conversations in English, Swahili, Sheng and code-mixing among young people living with HIV in informal settlements in Nairobi, Kenya, captured within two health-focused WhatsApp, with the chats moderated by a medical facilitator. We aim to assess how these LLMs handle culturally nuanced, multilingual, and code-mixed communications on a sentiment analysis task with the goal of supporting the facilitator, for instance, in flagging negative messages in the two WhatsApp chats.
Internal collaboration with MSR India
In collaboration with Microsoft Research Lab – India (opens in new tab), we’ve introduced MEGAVERSE (NACCL2024 Publication (opens in new tab)), an extension of the MEGA (EMNLP2023 Publication (opens in new tab)) framework, to evaluate non-English and multimodal capabilities of state-of-the-art LLMs. Covering 83 languages and utilizing multimodal datasets, our research highlights the superiority of larger models in processing low-resource languages including African Languages and addresses the critical issue of data contamination in multilingual evaluations.
External collaboration with Masakhane NLP
MSR Africa has partnered with the Masakhane NLP community (opens in new tab) to advance research and development in Machine Translation for the under-resourced African languages. Our joint efforts focus on enhancing accessibility, accuracy, and cultural relevance of language technologies across the continent. Below are highlights of our key projects:
- AfriCOMET Development: Developed an innovative evaluation metric, AfriCOMET, and introduced AfroXLM-R, improving the evaluation and accuracy of translations for 13 under-resourced African languages (NAACL2024 Publication (opens in new tab)).
- Pre-Trained Models Adaptation: Created a new African news corpus to adapt pre-trained language models to 16 African languages, enhancing their utility with fine-tuned, high-quality translation data (NAACL2022 Publication (opens in new tab)).
- Oshiwambo Language Resources: Developed the largest Oshindonga-English parallel corpus to date, promoting both technological advancement and cultural preservation through enhanced machine translation capabilities (AfricaNLP workshop at ICLR2022 Publication) (opens in new tab)