Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

Chris Quirk; Raghavendra Udupa; Arul Menezes

Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

Chris Quirk ,
Raghavendra Udupa ,
Arul Menezes

Proceedings of MT Summit XI | September 2007

Published by European Association for Machine Translation

Publication

Download BibTex

The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce competitive improvements on cross-domain test data without suffering in-domain degradation even at very large scale.