{"id":282233,"date":"2014-10-28T18:00:31","date_gmt":"2014-10-29T01:00:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=282233"},"modified":"2016-08-24T15:05:52","modified_gmt":"2016-08-24T22:05:52","slug":"new-deep-learning-take-image-recognition","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/new-deep-learning-take-image-recognition\/","title":{"rendered":"A New, Deep-Learning Take on Image Recognition"},"content":{"rendered":"

\"Microsoft
\nIn recent months, we\u2019ve heard a lot about deep neural networks and deep learning\u2014take Project Adam, for example\u2014and the sometimes eye-popping results they can have in addressing longstanding computing problems.<\/p>\n

The field of image recognition also is benefiting rapidly from the use of such networks, along with the availability of prodigious data sets. In this case, such networks are called \u201cdeep convolutional neural networks\u201d (CNNs), which are inspired by biological processes of the human brain.<\/p>\n

Existing CNNs have a problem, though: The algorithms are too slow for object detection in practice. The networks previously were applied thousands of times on a single image, just for detecting a few objects.<\/p>\n

Microsoft researchers have discovered a solution to this vexing computer-vision issue, though, one that will receive prominent mention in Beijing on Oct. 29 during the 16th annual Computing in the 21st Century Conference, an academic event founded and organized by Microsoft Research Asia (opens in new tab)<\/span><\/a>.<\/p>\n

\"Kaiming

Kaiming He<\/p><\/div>\n

The new solution speeds the deep-learning object-detection system by as many as 100 times, yet has outstanding accuracy. The advance is outlined in Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition<\/em> (opens in new tab)<\/span><\/a>, a research paper written by Kaiming He (opens in new tab)<\/span><\/a> and Jian Sun (opens in new tab)<\/span><\/a>, along with a couple of academics serving internships at the Asia lab: Xiangyu Zhang of Xi\u2019an Jiaotong University and Shaoqing Ren of the University of Science and Technology of China.<\/p>\n

\u201cImage recognition involves two core tasks: image classification and object detection,\u201d He explains. \u201cIn image classification, the computer is taught to recognize object categories, such as \u201cperson,\u201d \u201ccat,\u201d \u201cdog,\u201d or \u201cbike,\u201d while in object detection, the computer needs to provide the precise positions of the objects in the image.\u201d<\/p>\n

The second task, Sun adds, is the more difficult of the two.<\/p>\n

\u201cWe need,\u201d he says, \u201cto answer \u2018what and where\u2019 for one or more objects in an image.\u201d<\/p>\n

\"spatial

A diagram of where the spatial pyramid pooling layer fits into the network stack.<\/p><\/div>\n

The aforementioned paper introduces a powerful new network structure that uses \u201cspatial pyramid pooling\u201d (SPP)\u2014a technique that can generate a descriptor from a region of any size.<\/p>\n

As the paper makes clear, information aggregation is achieved deeper in the network. With the new technique, the network is only computed once on the image but still can produce descriptors for thousands of regions. These descriptors are used to detect objects quickly.<\/p>\n

CNNs, like other deep neural networks, are designed to mimic the structure of the human brain. The researchers\u2019 approach of aggregating the visual information at a deeper stage, they contend, conforms more with the hierarchical information processing that occurs in the brain.<\/p>\n

The paper that outlines their approach states:<\/p>\n

\u201cWhen an object comes into our field of view, it is more reasonable that our brains consider it as a whole instead of cropping it into several \u2018views\u2019 at the beginning. Similarly, it is unlikely that our brains distort all object candidates into fixed-size regions for detecting\/locating them. It is more likely that our brains handle arbitrarily shaped objects at some deeper layers by aggregating the already deeply processed information from the previous layers.\u201d<\/p>\n

The SPP paper aims to give networks a more principled pooling strategy. As it turns out, SPPs have properties that can transform CNNs. For one thing, spatial pyramid pooling uses a new pooling method to generate a fixed-size output regardless of the size of the input image, something that previous deep-learning research efforts couldn\u2019t do.<\/p>\n

\u201cThink of an image as a canvas,\u201d He says in explaining the pooling concept. \u201cWe may want to use small pieces of square paper to cover the entire canvas. When we are given another canvas of a different size, we may need a different number of pieces of square paper.<\/p>\n

\u201cThe \u2018spatial pyramid pooling\u2019 always uses the same number of pieces of paper to cover the canvas. To do so, we need to adjust the paper size according to the canvas size.\u201d<\/p>\n

\"Jian

Jian Sun<\/p><\/div>\n

Because of this flexibility of sizes, SPP can pool features extracted from any regions without repeatedly computing the convolutional networks. This property leads to a significantly faster object-detection system.<\/p>\n

How much? The new technique is 20 to 100 times faster than the previous leading solutions for object detection. The Microsoft researchers\u2019 solution is the only real-time CNN detection system among the 38 teams participating in the ImageNet Large Scale Visual Recognition Challenge 2014 (opens in new tab)<\/span><\/a>, which evaluates algorithms for object detection and image classification. The system also has top-notch accuracy, ranking second in detection and third in classification in the competition.<\/p>\n

Clearly, the SPP research has demonstrated the sort of progress that warrants further exploration.<\/p>\n

\u201cThough the current deep-learning models are breakthroughs over traditional methods, they are far from human performance, typically for the challenging detection task,\u201d He says. \u201cWe will continuously improve the quality of our methods.\u201d<\/p>\n

The ability to access ever-larger data sets will help advance the research.<\/p>\n

\u201cOne of the important next steps,\u201d Sun says, \u201cis to obtain much larger and richer training data. That will significantly impact the research in this direction.\u201d<\/p>\n

That said, it\u2019s hard for He to disguise his pride in what he and his colleagues have achieved.<\/p>\n

\u201cOur work is the fastest deep-learning system for accurate object detection,\u201d he says. \u201cThe speed is getting very close to the requirement for consumer usage.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"

In recent months, we\u2019ve heard a lot about deep neural networks and deep learning\u2014take Project Adam, for example\u2014and the sometimes eye-popping results they can have in addressing longstanding computing problems. The field of image recognition also is benefiting rapidly from the use of such networks, along with the availability of prodigious data sets. In this […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194471,194480,194485],"tags":[195155,210674,186925,201273,187011,208719,210683,210677,210680],"research-area":[13561,13562,13551,13547],"msr-region":[197903],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-282233","post","type-post","status-publish","format-standard","hentry","category-computer-vision","category-graphics-and-multimedia","category-networking","tag-computing-in-the-21st-century-conference","tag-deep-convolutional-neural-networks-cnns","tag-deep-learning","tag-deep-neural-networks","tag-image-classification","tag-image-recognition","tag-imagenet-large-scale-visual-recognition-challenge-2014","tag-object-detection","tag-spatial-pyramid-pooling-spp","msr-research-area-algorithms","msr-research-area-computer-vision","msr-research-area-graphics-and-multimedia","msr-research-area-systems-and-networking","msr-region-asia-pacific","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"October 28, 2014","formattedExcerpt":"In recent months, we\u2019ve heard a lot about deep neural networks and deep learning\u2014take Project Adam, for example\u2014and the sometimes eye-popping results they can have in addressing longstanding computing problems. The field of image recognition also is benefiting rapidly from the use of such networks,…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/282233"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=282233"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/282233\/revisions"}],"predecessor-version":[{"id":282854,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/282233\/revisions\/282854"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=282233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=282233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=282233"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=282233"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=282233"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=282233"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=282233"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=282233"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=282233"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=282233"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=282233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}