Project Mélange

Understanding MixEd LANguaGE and Code-mixing

Note: This research project has reached its conclusion. These pages are maintained for reference and archival purposes.

The goal of Project Mélange is to understand the uses of and build tools around code-mixing. Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and intelligent agents.

Project Mélange aims to analyze and understand code-switching behavior at two levels: first, the formal structural level that dictates the grammar of such a construct, and second, the functional level that motivates its use from a cognitive, pragmatic and socio-cultural perspective. This in turn would allow us to process mixed language as well as better model conversations and dialogues in a multilingual setting. In addition, Project Mélange aims to equip speech and language processing systems with the capabilities of processing, understanding and generating code-switched language. Towards this goal, we have worked on Language Identification from text, Part of Speech tagging, parsing, sentiment analysis, machine translation, speech recognition and synthesis. Code-mixed speech and language processing is a low-resource problem, often without adequate data available even for bootstrapping. Our philosophy is solving these problems is to re-use existing monolingual data and resources as much as possible.

In 2020, we proposed the first benchmark for code-switching, GLUECoS (opens in new tab), which spans 7 NLP tasks in English-Hindi and English-Spanish. We also released the first code-switched NLI dataset (opens in new tab) in Hindi-English based on Bollywood movie dialogues. Although we continue working on various aspects of multilingual systems and code-switching, the primary focus of our group is on Project ELLORA and Project LITMUS.