Making Sentence Embeddings Robust to User-Generated Content
This seminar was hosted by Microsoft Research Africa, Nairobi together with the Microsoft AI for Good team in May 2024.
User-generated content (UGC), e.g. social media posts written in “Internet language”, presents a lot of lexical variations and deviates from standard language. As a result, NLP models which were mostly trained on standard texts have been known to perform poorly on UGC, and sentence embedding models like LASER are no exception.
In this talk, we focus on the robustness of LASER to UGC data. We evaluate this robustness by LASER’s ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We also use data augmentation to generate synthetic UGC-like training data.
We show that RoLASER significantly improves LASER’s robustness to both natural and artificial UGC data by achieving up to 2× and 11× better alignment scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
Speaker Details
Lydia Nishimwe is a third-year PhD candidate at Inria and Sorbonne Université (France) under the supervision of Dr. Rachel Bawden and Dr. Benoît Sagot. Her thesis is centered around making NLP models robust to user-generated content (UGC), specifically posts from social media websites (e.g. Twitter, Reddit). She has published works on the lexical normalisation of UGC and on UGC-robust sentence embeddings, and is currently exploring the robust machine translation of UGC.
- Date:
- Speakers:
- Lydia Nishimwe
- Affiliation:
- Inria and Sorbonne Université
Watch Next
-
Advances in Natural Language Generation for Indian Languages
Speakers:- Dr. Raj Dabre
-
-
Microsoft Research India - who we are.
Speakers:- Kalika Bali,
- Sriram Rajamani,
- Venkat Padmanabhan
-
Multilingual Evaluation of Generative AI (MEGA)
Speakers:- Kabir Ahuja,
- Stephanie Nyairo,
- Millicent Ochieng
-
-