{"id":482808,"date":"2018-05-01T08:21:42","date_gmt":"2018-05-01T15:21:42","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=482808"},"modified":"2018-05-30T11:48:53","modified_gmt":"2018-05-30T18:48:53","slug":"learning-source-code","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-source-code\/","title":{"rendered":"Learning from Source Code"},"content":{"rendered":"
Over the last five years, deep learning-based methods have revolutionised a wide range of applications, for example those requiring understanding of pictures (opens in new tab)<\/span><\/a>, speech (opens in new tab)<\/span><\/a> and natural language (opens in new tab)<\/span><\/a>. For computer scientists, a naturally arising question is whether computers learn to understand source code? It appears to be a trivial question at first glance because programming languages indeed are designed to be understood by computers. However, many software bugs are in fact instances of Do what I mean, not what I wrote. In other words, small typos can have big consequences.<\/p>\n Consider a simple example such as:<\/p>\n In this example, the problem is obvious to a human, or a system that understands the meaning of the terms \u201cheight\u201d and \u201cwidth\u201d. The key insight here is that source code serves two functions. First, it communicates to the computer precisely which hardware instructions to execute. Second, it communicates to other programmers (or to the authors themselves six weeks later) how the program works. The latter is achieved by the choice of names, code layout and code comments. By identifying cases in which the two communication channels seem to diverge, an automatic system can point to likely software bugs.<\/p>\n Program analysis in the past has largely focused on either the formal, machine-interpretable semantics of programs or it has viewed programs as a (somewhat odd) instance of natural language. Approaches from the former are rooted in mathematical logic (opens in new tab)<\/span><\/a> and require extensive engineering effort for every new case that needs to be handled. On the other hand, natural language approaches involve applications of natural language processing (opens in new tab)<\/span><\/a> tools that work well on purely syntactic tasks but so far have not been able to learn semantics of programs.<\/p>\n In a new paper (opens in new tab)<\/span><\/a> presented at ICLR 2018 (opens in new tab)<\/span><\/a>, researchers from Microsoft Research and from Simon Fraser University Vancouver present a new way to combine these two worlds and show how to find bugs in released software.<\/p>\n To be able to learn from the rich structure of source code, it is first transformed into a program graph. The nodes of the graph include the tokens of the program (that is, variables, operators, method names, and so on) and the nodes of its abstract syntax tree (elements from the grammar defining the language syntax such as IfStatement). The program graph contains two different types of edges: syntactic edges, representing only how the code should be parsed, such as while loops and if blocks; and semantic edges that are the result of simple program analyses.<\/p>\nfloat getHeight { return this.width; }.<\/pre>\n
Program Graphs<\/h3>\n