{"id":546714,"date":"2018-11-01T13:40:14","date_gmt":"2018-11-01T20:40:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=546714"},"modified":"2018-11-01T13:40:14","modified_gmt":"2018-11-01T20:40:14","slug":"image-recognition-current-challenges-and-emerging-opportunities","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/image-recognition-current-challenges-and-emerging-opportunities\/","title":{"rendered":"Image Recognition: Current Challenges and Emerging Opportunities"},"content":{"rendered":"
By Jifeng Dai<\/a> and Steve Lin<\/a>,\u00a0Microsoft Research Asia<\/em><\/p>\n Over the decades that we\u2019ve spent as researchers and technical leaders in computer vision, there have been few developments as astounding as the rapid progress in image recognition. In the past several years, we have seen object detection performance skyrocket from approximately 30 percent in mean average precision to more than 90 percent today, on the PASCAL VOC benchmark. For image classification on the challenging ImageNet dataset, state-of-the-art algorithms now exceed human performance. These improvements in image understanding have begun to impact a wide range of high-value applications, including video surveillance, autonomous driving, and intelligent healthcare.<\/p>\n The driving force behind the recent advances in image recognition is deep learning, whose success is powered by the formation of large-scale datasets, the development of powerful models, and the availability of vast computational resources. For a variety of image recognition tasks, carefully designed deep neural networks have greatly surpassed previous methods that were based on hand-crafted image features. Yet despite the great success of deep learning in image recognition so far, there are numerous challenges that remain to be overcome before it can be employed for broader use.<\/p>\n One of these challenges is in how to train models that generalize well to real-world settings that have not been seen in training. In current practice, a model is trained and evaluated on a dataset that is randomly split into training and test sets. The test set thus has the same data distribution as the training set, as they both are sampled from the same range of scene content and imaging conditions that exist in this data. However, in real-world applications, the test images may come from data distributions different from those used in training. For example, the unseen data may differ in viewing angles, object scales, scene configurations, and camera properties. A recent study shows that such a gap in data distribution can lead to significant drops in accuracy over a wide variety of deep network architectures [1]. The susceptibility of current models to natural variations in the data distribution can be a severe drawback in critical applications such as autonomous vehicle navigation.<\/p>\n Another existing challenge is how to better exploit small-scale training data. While deep learning has shown great success in various tasks with a large amount of labeled data, current techniques generally break down if few labeled examples are available. This condition is often referred to as few-shot learning and it demands careful consideration in practical applications. For example, a household robot is expected to recognize a new object after being shown it just once. A human can naturally do so even if the object is manipulated, such as folding up a blanket. How to endow deep learning networks with such generalization ability is an open problem in research.<\/p>\n At the other extreme is how the performance of recognition algorithms can be effectively scaled with ultra-large-scale data. For critical applications such as autonomous driving, the cost of recognition errors is very high. So enormous datasets, containing hundreds of millions of images with rich annotations, are built with hopes that the accuracy of the trained models can be dramatically improved. However, a recent study suggests that current algorithms cannot necessarily exploit such ultra-large-scale data\u00a0as effectively [2]. On the JFT dataset containing 300 million annotated images, the performance of a variety of deep networks increases just logarithmically with respect to the amount of training data (Image 1). The diminishing benefits of greater training data at large scales present a significant issue to address.<\/p>\nImproving model generalization<\/h3>\n
Exploiting small and ultra-large-scale data<\/h3>\n