{"id":584920,"date":"2019-05-08T09:59:17","date_gmt":"2019-05-08T16:59:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=584920"},"modified":"2019-06-26T14:16:42","modified_gmt":"2019-06-26T21:16:42","slug":"spacefusion-structuring-the-unstructured-latent-space-for-conversational-ai","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/spacefusion-structuring-the-unstructured-latent-space-for-conversational-ai\/","title":{"rendered":"SpaceFusion: Structuring the unstructured latent space for conversational AI"},"content":{"rendered":"<p>A palette makes it easy for painters to arrange and mix paints of different colors as they create art on the canvas before them. Having a similar tool that could allow AI to jointly learn from diverse data sources such as those for conversations, narratives, images, and knowledge could open doors for researchers and scientists to develop AI systems capable of more general intelligence.<\/p>\n<div id=\"attachment_584962\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/neural_response_generation_figure_0.png.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-584962\" class=\"wp-image-584962\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/neural_response_generation_figure_0.png.jpg\" alt=\"A palette allows a painter to arrange and mix paints of different colors. SpaceFusion seeks to help AI scientists do similar things for different models trained on different datasets.\" width=\"300\" height=\"385\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/neural_response_generation_figure_0.png.jpg 358w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/neural_response_generation_figure_0.png-234x300.jpg 234w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-584962\" class=\"wp-caption-text\">A palette allows a painter to arrange and mix paints of different colors. SpaceFusion seeks to help AI scientists do similar things for different models trained on different datasets.<\/p><\/div>\n<p>For deep learning models today, datasets are usually represented by vectors in different latent spaces using different neural networks. In the paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/jointly-optimizing-diversity-and-relevance-in-neural-response-generation-2\/\">Jointly Optimizing Diversity and Relevance in Neural Response Generation<\/a>,\u201d my co-authors and I propose SpaceFusion, a learning paradigm to align these different latent spaces\u2014arrange and mix them smoothly like the paint on a palette\u2014so AI can leverage the patterns and knowledge embedded in each of them. This work, which we\u2019re presenting at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/naacl2019.org\/\">2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)<\/a>, is part of the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/data-driven-conversation\/\">Data-Driven Conversation<\/a> project, and an implementation of it is available on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/golsun\/SpaceFusion\">GitHub<\/a>.<\/p>\n<h3>Capturing the color of human conversation<\/h3>\n<p>As a first attempt, we applied this technique to neural conversational AI. In our setup, a neural model is expected to generate relevant and interesting responses given a conversation history, or context. While promising advances in neural conversation models have been made, these models tend to play it safe, producing generic and dull responses. Approaches have been developed to diversify these responses and better capture the color of human conversation, but oftentimes, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/generating-informative-and-diverse-conversational-responses-via-adversarial-information-maximization\/\">there is a tradeoff, with relevancy declining<\/a>.<\/p>\n<div id=\"attachment_584971\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-584971\" class=\"wp-image-584971\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2-1024x565.jpg\" alt=\"Figure 1: Like a palette allows for the easy combination of paints, SpaceFusion aligns, or mixes, the latent spaces learned from a sequence-to-sequence (S2S, red dots) model and an autoencoder (AE, blue dots) to jointly utilize the two models more efficiently.\" width=\"600\" height=\"331\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2-1024x565.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2-300x165.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2-768x424.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/mix_S2S_AE_v2.jpg 1329w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><p id=\"caption-attachment-584971\" class=\"wp-caption-text\">Figure 1: Like a palette allows for the easy combination of paints, SpaceFusion aligns, or mixes, the latent spaces learned from a sequence-to-sequence (S2S, <span class=\"wp-caption-text\" style=\"color: #ff0000;\">red<\/span> dots) model and an autoencoder (AE, <span class=\"wp-caption-text\" style=\"color: #0000ff;\">blue<\/span> dots) to jointly utilize the two models more efficiently.<\/p><\/div>\n<p>SpaceFusion tackles this problem by aligning the latent spaces learned from two models (Figure 1):<\/p>\n<ul>\n<li>a <strong>sequence-to-sequence (S2S) model<\/strong>, which aims to produce relevant responses, but may lack diversity; and<\/li>\n<li>an <strong>autoencoder (AE) model<\/strong>, which is capable of representing diverse responses, but doesn\u2019t capture their relation to the conversation.<\/li>\n<\/ul>\n<p>The jointly learned model can utilize the strengths of both models and arrange data points in a more structured way.<\/p>\n<div id=\"attachment_584938\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-584938\" class=\"wp-image-584938\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-1024x577.png\" alt=\"Figure 2: The above illustrates one context and its multiple responses in the latent space induced by SpaceFusion. Distance and direction from the predicted response vector given the context roughly match the relevance and diversity, respectively.\" width=\"600\" height=\"338\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-1024x577.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-768x433.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788.png 1401w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><p id=\"caption-attachment-584938\" class=\"wp-caption-text\">Figure 2: The above illustrates one context and its multiple responses in the latent space induced by SpaceFusion. Distance and direction from the predicted response vector given the context roughly match the relevance and diversity, respectively.<\/p><\/div>\n<p>For example, as illustrated in Figure 2, given a context\u2014in this case, \u201cAnyone want to start this game?\u201d\u2014the positive responses \u201cI\u2019d love to play it\u201d and \u201cYes, I do\u201d are arranged along the same direction. The negative ones\u2014\u201cI\u2019m not interested in the game\u201d and \u201cNo, I don\u2019t\u201d\u2014are mapped on a line in another direction. Diversity in responses is achieved by exploring the latent space along different directions. Furthermore, the distance in the latent space corresponds to the relevancy. Responses farther away from the context\u2014\u201cYes, I do\u201d and \u201cNo, I don\u2019t\u201d\u2014are usually generic, while those closer are more relevant to the specific context: \u201cI\u2019m not interested in the game\u201d and \u201cWhen will you?\u201d<\/p>\n<p>SpaceFusion disentangles the criteria of relevancy and diversity and represents them in two independent dimensions\u2014direction and distance\u2014making it easier to jointly optimize both. Our empirical experiments and human evaluation have shown that SpaceFusion performs better in these two criteria compared to competitive baselines.<\/p>\n<h3>Learning a shared latent space<\/h3>\n<p>So, how exactly does SpaceFusion align different latent spaces?<\/p>\n<p>The idea is quite intuitive: For each pair of points from two different latent spaces, we first minimize their distance in the shared latent space and then encourage a smooth transition between them. This is done by adding two novel regularization terms\u2014distance term and smoothness term\u2014to the objective function.<\/p>\n<p>Taking conversation as the example, the distance term measures the Euclidean distance between a point from the S2S latent space, which is mapped from the context and represents the predicted response, and the points from the AE latent space, which correspond to its target responses. Minimizing such distance encourages the S2S model to map the context to a point close to and surrounded by its responses in the shared latent space, as illustrated by Figure 2.<\/p>\n<p>The smoothness term measures the likelihood of generating the target response from a random interpolation between the point mapped from the context and the one mapped from the response. By maximizing this likelihood, we encourage a smooth transition of the meaning of the generated responses as we move away from the context. This allows us to explore the neighborhood of the prediction point made by the S2S and thus generate diverse responses that are relevant to the context.<\/p>\n<p>With these two novel regularizations added in the objective function, we put the distance and smoothness constraints on the learning of latent space, so the training will not only focus on the performance on each latent space, but also try to align them together by adding these desired structures. Our work focused on conversational models, but we expect that SpaceFusion can align the latent spaces learned by other models trained on different datasets. This makes it possible to bridge different abilities and knowledge domains learned by each specific AI system and is a baby step toward a more general intelligence.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A palette makes it easy for painters to arrange and mix paints of different colors as they create art on the canvas before them. Having a similar tool that could allow AI to jointly learn from diverse data sources such as those for conversations, narratives, images, and knowledge could open doors for researchers and scientists [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":584938,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[243622],"tags":[],"research-area":[13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-584920","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-human-language-technologies","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[171447],"related-events":[589690],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788.png 1401w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-768x433.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-1024x577.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/NAACL_Neural-Response-Generation_Site_1400x788-343x193.png 343w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"Xiang Gao","formattedDate":"May 8, 2019","formattedExcerpt":"A palette makes it easy for painters to arrange and mix paints of different colors as they create art on the canvas before them. Having a similar tool that could allow AI to jointly learn from diverse data sources such as those for conversations, narratives,&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/584920"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=584920"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/584920\/revisions"}],"predecessor-version":[{"id":585061,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/584920\/revisions\/585061"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/584938"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=584920"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=584920"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=584920"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=584920"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=584920"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=584920"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=584920"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=584920"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=584920"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=584920"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=584920"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}