About
Hello, and thanks for stopping by! I’m a Principal Researcher at Microsoft Research India where I work on Natural Language Processing. My research interests are broadly in democratizing AI and currently, the focus of my work is on making LLMs more inclusive to more languages and cultures. I’m excited about translating long term and cutting-edge research in data, modeling and evaluation into large-scale real-world product improvements. I thrive in leading and working with diverse, interdisciplinary teams and regularly work with social scientists, linguists, designers and product groups.
I am the Director of the MSR India Research Fellow program (opens in new tab). The RF program is a unique opportunity for students who are considering pursing a career in research. The RF program exposes bright young minds to cutting-edge research and provides mentorship from some of the top researchers in the world. My goal is to provide the best possible experience for RFs during their stint at MSRI to help them realize their potential.
I am active in the NLP research community and have organized several special sessions (Interspeech 2016-2018, IWSDS 2023), workshops (code-switching 2020, 2021, 2023, evaluation – 2022) and given many invited talks in academia and industry. I was co-chair of the Industry track at ACL 2023 and a Senior Area Chair of the Multilingualism track at ACL 2023 and EMNLP 2023.
For up-to-date information about publications, please take a look at my Google Scholar page (opens in new tab).
I have been fortunate to work with many wonderful interns and Research Fellows who inspire me and keep me on my toes! In reverse chronological order:
Sanchit Ahuja (current RF), Divyanshu Aggrawal (current RF), Ishaan Watts (current intern), Ashutosh Sathe (current intern), Prachi Jain (current PostDoc), Kabir Ahuja (PhD at University of Washington), Krithika Ramesh (PhD at Johns Hopkins University), Shrey Pandit (MS at UT Austin), Abhinav Rao (RF @ Microsoft Turing -> MS at CMU), Aniket Vashishtha (RF @ MSRI), Shaily Bhat (RF @ Google Research -> PhD at CMU), Simran Khanuja (RF @ Google Research -> PhD @ Carnegie Mellon University), Anirudh Srinivasan (MS @ UT Austin), Sanket Shah (Salesken.ai), Brij Mohan Lal Srivastava (PhD @ INRIA – > Nijta (startup)), Sunit Sivasankaran (PhD @ INRIA -> Microsoft), Sai Krishna Rallabandi (PhD @ CMU -> Fidelity).
*NEWS*
I will be giving a keynote at the Multilingual Representation Learning (MRL) workshop at EMNLP 2023
Information about the CALCS code-switching workshop at EMNLP 2023 is here https://code-switching.github.io/2023 (opens in new tab)
I am organizing the GenAI workshop @ AIMLSystems 2023 (opens in new tab)
We presented a comprehensive tutorial on Multilingual Language Models at ACL 2023 http://approjects.co.za/?big=en-us/research/event/acl-2023-multilingual-models-tutorial/
I’m serving as SAC for ACL 2023 and EMNLP 2023 and Industry Track co-chair for ACL 2023
*OLDER NEWS*
Please consider submitting a paper to the Languages special issue on “Interdisciplinary Approaches to Data Collection, Annotation and Computational Processing of Code-Switched Languages around the World”. You can find more details here (opens in new tab).
I was invited to be a speaker at the VAIBHAV summit (opens in new tab) organized by the Govt. of India for the AI/ML Speech Understanding panel.
I was part of the Students Meet Experts (opens in new tab) session at Interspeech 2020 organized by the ISCA-SAC,
Our survey paper on code-switching, that covers more than 250 papers is now available on arxiv (opens in new tab).
Code and Datasets
Our code for evaluation of multilingual systems, LITMUS Predictor (opens in new tab) is now open source. Please also check out the LITMUS Predictor demo here (opens in new tab).
Our benchmark for evaluating code-switched NLP called GLUECoS is now open source, along with scripts for pre-processing 11 code-switched datasets! Get the code here (opens in new tab).
We built the first code-switched NLI dataset using Bollywood movie data as premises. Check out the paper and data here (opens in new tab). We also released a tool for Language Identification from text here (opens in new tab).
Code-switched data for the Language Identification shared task (opens in new tab) organized as part of the First Workshop on Speech Technologies for Code-switching for Multilingual Communities is now available for research use.
I also organized a shared task on ASR for low resource languages (opens in new tab) in a special session at Interspeech 2018, and we released data from three low-resource Indian languages (opens in new tab) as part of this challenge which is now available for research use.
Prior to coming to MSR India
I finished my PhD in 2015 at the Language Technologies Institute, Carnegie Mellon University. I worked on Text-to-Speech systems with my advisor Alan W Black (opens in new tab), and my thesis was on pronunciation modeling for low-resource languages. From 2010-2012, I was a Masters student at CMU with Jack Mostow (opens in new tab), and I worked on children’s oral reading prosody. I also interned with Microsoft Research India in Summer 2012 and we built a low-vocabulary ASR system for farmers in rural central India.