One-to-many testing for code generation from (just) natural language

MBPP is a popular dataset for evaluating models on the task of code generation. Despite its popularity there are three problems with the original MBPP: (1) reliance on providing test cases to generate the right signature, (2) contamination of the exact phrasing being present in training datasets, and (3) poor alignment between instruction and evaluation testcases. To overcome this, we create MBUPP, by adapting the popular MBPP dataset for code generation from natural language to emphasize on the natural language aspect by evaluating generated code on multiple sets of assertions. Additionally, we update the text descriptions to remove ambiguity and instructions that are not evaluated by the assertions, like specific algorithms to use. This adapted dataset resolves the challenges around contamination, ambiguity and testcase alignment. Further, we compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.