CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search.

Xiaonan Li; Yeyun Gong; Yelong Shen; Xipeng Qiu; Hang Zhang; Bolun Yao; Weizhen Qi; Daxin Jiang (姜大昕); Weizhu Chen; Nan Duan

CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search.

Xiaonan Li ,
Yeyun Gong ,
Yelong Shen ,
Xipeng Qiu ,
Hang Zhang ,
Bolun Yao ,
Weizhen Qi ,
Daxin Jiang (姜大昕) ,
Weizhu Chen ,
Nan Duan

EMNLP 2022 | October 2022

Download BibTex

In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training.
Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level). These results demonstrate the effectiveness and robustness of CodeRetriever.