{"id":699139,"date":"2020-10-20T10:18:59","date_gmt":"2020-10-20T17:18:59","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=699139"},"modified":"2020-10-20T13:44:06","modified_gmt":"2020-10-20T20:44:06","slug":"a-holistic-representation-toward-integrative-ai","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/","title":{"rendered":"A holistic representation toward integrative AI"},"content":{"rendered":"\n

At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI Cognitive Services, I have been working with a team of amazing scientists and engineers to turn this quest into a reality.<\/p>\n\n\n\n

In my role, I enjoy a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text (X<\/strong>), audio or visual sensory signals, (Y<\/strong>) and multilingual (Z<\/strong>). At the intersection of all three, there\u2019s magic\u2014what we call XYZ-code as illustrated in Figure 1\u2014a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pretrained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today.<\/p>\n\n\n\n

Over the past five years, we have achieved human performance on benchmarks in conversational speech recognition<\/a>, machine translation<\/a>, conversational question answering<\/a>, machine reading comprehension<\/a>, and image captioning<\/a>. These five breakthroughs provided us with strong signals toward our more ambitious aspiration to produce a leap in AI capabilities, achieving multisensory and multilingual learning that is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks.<\/p>\n\n\n\n

\"diagram,
Figure 1: XYZ<\/strong>-code for delivering a leap in AI capabilities. We can derive more powerful representations by intersecting X, Y, and Z.<\/figcaption><\/figure><\/div>\n\n\n\n

X-code: Text representation from big data<\/h3>\n\n\n\n

The quest to achieve universal representation of monolingual text is our X-code. <\/strong>As early as 2013, we sought to maximize the information-theoretic mutual information between text-based Bing search queries and related documents through semantic embedding using what we called X-code. X-code improved Bing search tasks and confirmed the relevancy of text representation trained from big data. X-code was shipped in Microsoft Bing without publishing the architecture as illustrated in Figure 2. That push was modernized with Transformer-based neural models such as BERT (opens in new tab)<\/span><\/a>, Turing (opens in new tab)<\/span><\/a>, and GPT-3 (opens in new tab)<\/span><\/a>, which have significantly advanced text-based monolingual pretraining for natural language processing.<\/p>\n\n\n\n

X-code maps queries, query terms, and documentations into a high-dimensional intent space. By maximizing the information-theoretic mutual information of these representations based on 50 billion unique query-documentation pairs as training data, X-code successfully learned the semantic relationships among queries and documents at web scale, and it demonstrated strong performance in various natural language processing tasks such as search ranking, ad click prediction, query-to-query similarity, and documentation grouping.<\/p>\n\n\n\n

\"Key:<\/a>
Figure 2: In 2013, X-code aimed at maximizing information-theoretic mutual information for improved semantic text representation at scale. It is trained using search engine click log by query-URL joint optimization. X-code captures the similarity of words and web documents in embedding space, and it can be used for various NLP tasks. Diagram from the original 2013 architecture.<\/figcaption><\/figure>\n\n\n\n

Y-code: Adding the power of visual and audio sensory signals<\/h3>\n\n\n\n

Our pursuit of sensory-related AI is encompassed within Y-code.<\/strong> With Y referring to either audio or visual signals, joint optimization of X and Y attributes can help image captioning, speech, form, or OCR recognition. With the joint XY-code or simply Y-code, we aim to optimize text and audio or visual signals together.<\/p>\n\n\n\n

Our diligence with Y-code has recently surpassed human performance in image captioning on the NOCAPS benchmark, as illustrated in Figure 3 and described in this novel object captioning blog post<\/a>. With this architecture, we were able to determine novel objects from visual information and add a layer of language understanding to compose a sentence describing the relationship between them. In many cases, it\u2019s more accurate than the descriptions people write. The breakthroughs in model quality further demonstrate that the intersection between X and Y attributes can significantly help us gain additional horsepower for downstream AI tasks.<\/p>\n\n\n\n

\n\t