{"id":775786,"date":"2021-06-30T17:02:44","date_gmt":"2021-07-01T00:02:44","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=775786"},"modified":"2021-10-19T08:06:12","modified_gmt":"2021-10-19T15:06:12","slug":"visual-recognition-beyond-appearances-and-its-robotic-applications","status":"publish","type":"msr-video","link":"https:\/\/www.microsoft.com\/en-us\/research\/video\/visual-recognition-beyond-appearances-and-its-robotic-applications\/","title":{"rendered":"Visual Recognition beyond Appearances, and its Robotic Applications"},"content":{"rendered":"<p>The goal of Computer Vision, as coined by Marr, is to develop algorithms to answer What are Where at When from visual appearance. The speaker, among others, recognizes the importance of studying underlying entities and relations beyond visual appearance, following an Active Perception paradigm. This talk will present the speaker&#8217;s efforts over the last decade, ranging from 1) reasoning beyond appearance for visual question answering, image\/video captioning tasks, and their evaluation, through 2) temporal and self-supervised knowledge distillation with incremental knowledge transfer, till 3) their roles in a Robotic visual learning framework via a Robotic Indoor Object Search task. The talk will also feature the Active Perception Group (APG)\u2019s ongoing projects (NSF RI, NRI and CPS, DARPA KAIROS, and Arizona IAM) addressing emerging challenges of the nation in autonomous driving and AI security domains, at the ASU School of Computing, Informatics, and Decision Systems Engineering (CIDSE).<\/p>\n<div>\n\t<a\n\t\thref=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/06302021_Yezhou_Yang_2021SpringTalk-Robotics.pdf\"\n\t\tclass=\"button cta-link\"\n\t\tdata-bi-type=\"button\"\n\t\tdata-bi-cN=\"View slides\"\n\t\tdata-bi-tN=\"shortcodes\/msr-button\"\n\t\ttarget=\"_blank\" rel=\"noopener noreferrer\">\n\t\tView slides\t<\/a>\n\n\t<\/div>\n<div style=\"height: 20px;\"><\/div>\n<p><em>List of major papers covered in the talk:<\/em><\/p>\n<p><strong>V&L model robustness<\/strong><br \/>\nECCV 2020: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2002.08325\" target=\"_blank\" rel=\"noopener\">VQA-LOL: Visual Question Answering under the Lens of Logic<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br \/>\nACL 2021: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2106.01444\" target=\"_blank\" rel=\"noopener\">SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br \/>\nEMNLP 2020: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2009.08566\" target=\"_blank\" rel=\"noopener\">MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br \/>\nEMNLP 2020:\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2003.05162\" target=\"_blank\" rel=\"noopener\">Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p><strong>Robotic object search<\/strong><br \/>\nCVPR 2021: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2103.01350\" target=\"_blank\" rel=\"noopener\">Hierarchical and Partially Observable Goal-driven Policy Learning with Goals Relational Graph<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br \/>\nICRA 2021\/RA-L: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2010.08596\" target=\"_blank\" rel=\"noopener\">Efficient Robotic Object Search via HIEM: Hierarchical Policy Learning with Intrinsic-Extrinsic Modeling<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p><em>Other teasers:<\/em><\/p>\n<p><strong>AI security\/GAN attribution<\/strong><br \/>\nICLR 2021: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2010.13974\" target=\"_blank\" rel=\"noopener\">Decentralized Attribution of Generative Models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br \/>\nAAAI 2021: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2012.01806\" target=\"_blank\" rel=\"noopener\">Attribute-Guided Adversarial Training for Robustness to Natural Perturbations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The goal of Computer Vision, as coined by Marr, is to develop algorithms to answer What are Where at When from visual appearance. The speaker, among others, recognizes the importance of studying underlying entities and relations beyond visual appearance, following an Active Perception paradigm. This talk will present the speaker&#8217;s efforts over the last decade, [&hellip;]<\/p>\n","protected":false},"featured_media":775789,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13562],"msr-video-type":[259633],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-775786","msr-video","type-msr-video","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-video-type-vision-language-summer-talk-series","msr-locale-en_us"],"msr_download_urls":"","msr_external_url":"https:\/\/youtu.be\/RRxbNcgvPG4","msr_secondary_video_url":"","msr_video_file":"","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/775786"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-video"}],"version-history":[{"count":5,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/775786\/revisions"}],"predecessor-version":[{"id":786304,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/775786\/revisions\/786304"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/775789"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=775786"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=775786"},{"taxonomy":"msr-video-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video-type?post=775786"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=775786"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=775786"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=775786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}