Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
- Shruti Rijhwani ,
- Royal Sequiera ,
- Monojit Choudhury ,
- Kalika Bali ,
- Chandra Maddila
Proc. of ACL 2017 |
Published by ACL
Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as region-specific, with 58M tweets.