{"id":506933,"date":"2018-09-21T19:55:00","date_gmt":"2018-09-22T02:55:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=506933"},"modified":"2024-02-28T07:05:43","modified_gmt":"2024-02-28T15:05:43","slug":"turbo-learning-for-captionbot-and-drawingbot","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/turbo-learning-for-captionbot-and-drawingbot\/","title":{"rendered":"Turbo Learning for CaptionBot and DrawingBot"},"content":{"rendered":"
We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. CaptionBot) and a text-to-image generator (a.k.a. DrawingBot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a closed loop to provide informative feedback to each other. Based on such feedback, we introduce a new loss metric by comparing the original input with the output produced by the closed loop. In addition to the old loss metrics used in CaptionBot and DrawingBot, this extra loss metric makes the jointly trained CaptionBot and DrawingBot better than the separately trained CaptionBot and DrawingBot. Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide pseudo-labels for unlabeled samples. Experimental results on the COCO dataset demonstrate that the proposed turbo learning can significantly improve the performance of both CaptionBot and DrawingBot by a large margin.<\/p>\n","protected":false},"excerpt":{"rendered":"
We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. CaptionBot) and a text-to-image generator (a.k.a. DrawingBot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556,13562,13545,13554],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-506933","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-human-language-technologies","msr-research-area-human-computer-interaction","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2018-12-5","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"506936","msr_publicationurl":"https:\/\/arxiv.org\/abs\/1805.08170","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/09\/turbo-learning-captionbot-drawingbot.pdf","id":"734137","title":"turbo-learning-captionbot-drawingbot","label_id":"243132","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/abs\/1805.08170","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":0,"url":"https:\/\/arxiv.org\/abs\/1805.08170"},{"id":734137,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/03\/turbo-learning-captionbot-drawingbot.pdf"},{"id":553848,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Turbo.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Qiuyuan Huang","user_id":36356,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Qiuyuan Huang"},{"type":"user_nicename","value":"Pengchuan Zhang","user_id":36587,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Pengchuan Zhang"},{"type":"text","value":"Oliver Wu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Lei Zhang","user_id":32641,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Lei Zhang"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[508112],"msr_group":[144931],"msr_project":[788159,394646],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":788159,"post_title":"Agent AI","post_name":"agent-ai","post_type":"msr-project","post_date":"2023-09-25 21:53:00","post_modified":"2024-02-28 07:03:22","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/agent-ai\/","post_excerpt":"Agent-based multimodal AI systems are becoming a ubiquitous presence in our everyday lives. A promising direction for making these systems more interactive is to embody them as agents within specific environments. The grounding of large foundation models to act as agents within specific environments can provide a way of incorporating visual and contextual information into an embodied system. For example, a system that can perceive user actions, human behavior, environment objects, audio expressions, and the…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/788159"}]}},{"ID":394646,"post_title":"Vision and Language Intelligence","post_name":"vision-and-language-intelligence","post_type":"msr-project","post_date":"2017-07-19 07:24:45","post_modified":"2024-02-25 13:55:43","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/vision-and-language-intelligence\/","post_excerpt":"We are focusing on understanding, reasoning, and generation across language and vision, and creation of intelligent services, including vision-to-text captioning, text-to-vision generation, and question answering\/dialog about images and videos.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/394646"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/506933","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/506933\/revisions"}],"predecessor-version":[{"id":1010343,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/506933\/revisions\/1010343"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=506933"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=506933"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=506933"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=506933"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=506933"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=506933"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=506933"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=506933"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=506933"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=506933"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=506933"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=506933"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=506933"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=506933"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=506933"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=506933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}