Code intelligence project aims to leverage AI techniques to help software developers improve the productivity of the development process. We focus on building large-scale pre-trained models to understand and generate source codes. The research directions include pre-trained models for code, benchmark datasets, code completion, code retrieval, code review, etc. More AI-assisted products under collaboration with DevDiv, GitHub, and LinkedIn will be released which can empower the software developers all over the world.
What have we done?
- We propose several pre-trained models for source code, including CodeBERT, GraphCodeBERT and UniXcoder.
- CodeBERT is the first bimodal pre-trained model for programming language and natural language.
- GraphCodeBERT, based on CodeBERT, leverages a semantic-level structure of code, i.e., data flow, in the pre-training stage.
- UniXcoder is a unified cross-modal pre-trained model for programming language that incorporates semantic and syntax information from code comment and AST.
- We establish a benchmark CodeXGLUE for code intelligence which includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE includes 14 datasets for 10 diversified code intelligence tasks.
- Besides the general pre-trained models and datasets, we also explore deeply in some specific code scenarios, including code completion, code search, code review, etc. For code completion, we develop eWASH, which uses extended context for code completion; Grammformer, which learns to complete code with sketches; and ReACC, a retrieval-augmented framework. We have also developed CodeReviewer for automating code review activities such as review comment generation and code refinement.