{"id":307622,"date":"2007-05-10T11:00:21","date_gmt":"2007-05-10T18:00:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=307622"},"modified":"2016-10-18T23:16:41","modified_gmt":"2016-10-19T06:16:41","slug":"text-search-tricks-speak-volumes-image-search","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/text-search-tricks-speak-volumes-image-search\/","title":{"rendered":"Text-Search Tricks Speak Volumes in Image Search"},"content":{"rendered":"

By Rob Knies, Managing Editor, Microsoft Research<\/em><\/p>\n

It\u2019s fair to assume that anyone who knows anything about the Web\u2014anyone reading these words\u2014is comfortably familiar with text search. It has become perhaps the pre-eminent way to extract information from the Internet, and it is extremely simple to invoke: Type a few characters into a search box, hit Go, and see what results.<\/p>\n

Simple for text, that is. It\u2019s not quite so easy for images, which are proliferating on the Web these days. Sure, you can search for images, but what you get is based on the words associated with the image. The problem is that not all images are annotated, or adequately so.<\/p>\n

Pure image search\u2014based on the image itself, not words appended to it\u2014is difficult. But it would be extremely valuable were it to be harnessed. Imagine yourself lost in an unfamiliar part of town. You pull out your camera phone, shoot a photo of your surroundings, send it to a database of images from your city, and, voil\u00e0, get a match that not only tells you where you\u2019re at, but also provides a map that shows you how to get to where you\u2019re going.<\/p>\n

Sound too good to be true? Well, maybe. But don\u2019t tell that to Xing Xie<\/a> and Menglei Jia.<\/p>\n

Xie and Jia work for Microsoft Research Asia<\/a>, a lab known for its computer-vision acumen, and lately, they have been conducting experiments as part of a project called Photo2Search. The fruits of that research someday may revolutionize the way people interact with their surroundings.<\/p>\n

The project uses a data set of more than a million photos of Seattle from Virtual Earth<\/a>\u2122. And the researchers have applied a novel idea to leverage the advances achieved in text search to enable people to apply familiar search concepts to images, as well.<\/p>\n

\u201cThis project is about how to search millions of photos using a mobile phone,\u201d says Xie, lead researcher for the Web Search & Mining group for Microsoft Research\u2019s Beijing-based lab. \u201cWe could build an index for these photos, use the photo to find the most similar ones in the database, and have the location and other information returned to us.\u201d<\/p>\n

Such results, conceivably, could be achieved today, but at a prohibitively high computational cost. It\u2019s just not as easy for a computer to search for images or parts of images as it is to search for less-complex words.<\/p>\n

If you can\u2019t beat \u2018em, join \u2018em. Xie and Jia have decided to pursue the path of generating \u201cimage words,\u201d which would be collected into an \u201cimage vocabulary.\u201d That parallel with text retrieval, they think, could help lead to similar search success.<\/p>\n

The concept itself isn\u2019t new. What is new, though, is applying it to a data set of a million images. Such a vast collection enables the creation of a vocabulary rich enough to be extended to other data sets, of different sizes or even different types.<\/p>\n

\u201cSome researchers,\u201d says Jia, who also works at Microsoft Research Asia, \u201cclaimed that a common vocabulary is not available. But their vocabularies were based on thousands of images. They claimed that those vocabularies cannot be shared between different databases. Our contribution is that we use a million photos.\u201d<\/p>\n

Xie agrees.<\/p>\n

\u201cOur main contribution is that we use a very large-scale database to study this problem,\u201d he says. \u201cThis is the first study on a very large-scale database, so we believe the conclusions from this study are more reliable.\u201d<\/p>\n

To construct an \u201cimage vocabulary,\u201d you first must create \u201cimage words.\u201d And just as a text word is composed of a collection of elements\u2014letters\u2014so is an \u201cimage word.\u201d<\/p>\n

\u201cFor each image,\u201d Xie explains, \u201cwe first detect the image features, by feature-point detector. That means we can find some special points, some interest points, in the image.\u201d<\/p>\n

A photo of a building, for example, may have image features such as windows or doors. Each feature is represented by a vector, describing its characteristics such as orientation and intensity, and a collection of vectors becomes a set of elements that, combined, form an \u201cimage word.\u201d<\/p>\n

The \u201cimage word\u201d\u2014in actuality a cluster of image features\u2014itself has no semantic meaning; it\u2019s just a collection of data. But the specifics of those data, taken in aggregate, can be compiled and compared across a data set to locate similar collections that can represent the same image, even if viewed from a different angle, under different lighting conditions, or with different objects present or absent in an image.<\/p>\n

Take two photos of a house. One has a car in front, one does not. But the number of windows, and their visual characteristics, remains constant. Identify the similarity of those constants, and you have a match.<\/p>\n

\u201cIf we can map each image as a document,\u201d Xie says, \u201cthen we can also build a similar search image for these image documents. And if we build a vocabulary for image features, we can map each image feature to an image word.\u201d<\/p>\n

Those image words have the advantage of shedding the complex, highly dimensional nature of the image itself and thus making the search fast enough to become feasible. Instead of looking for a white house with three windows on a sunny day with a car parked in front, a search engine can search for a house with three windows that are equally far apart, relatively speaking, oriented in a similar direction, and shaped alike.<\/p>\n

The project started last summer, with Jia serving as the main developer for the effort. And with the huge data set at their disposal, he and Xie began to examine the ramifications of what they had available to them.<\/p>\n

\u201cOur experiment,\u201d Xie says, \u201cshowed that a vocabulary generated from a million-size data set can achieve satisfactory accuracy when applied to either larger data sets or different types of data sets.\u201d<\/p>\n

One interesting result of their work is the determination that, when it comes to image vocabularies, bigger is not necessarily better.<\/p>\n

\u201cActually, a larger vocabulary size can lead to a larger computational cost,\u201d Xie says. \u201cWe\u2019re better off to use a smaller vocabulary size when it\u2019s possible.\u201d<\/p>\n

The work seemed to indicate that an image vocabulary of 300,000 words would be a perhaps optimal size to minimize analysis time while maximizing accuracy.<\/p>\n

The project grew out of earlier work Xie had performed that involved searching databases via camera phones. But that work used data sets in the tens of thousands, not enough to lend themselves to the idea of vocabulary generation and the advantages an image vocabulary could offer.<\/p>\n

\u201cPhoto-sharing Web sites like Flickr,\u201d he notes, \u201cget a million new photos each day. So the most critical thing is that we want to support very large-scale image databases.\u201d<\/p>\n

The project has identified further work that needs to be done before such techniques can be released to the world at large.<\/p>\n

\u201cThere are two criteria,\u201d Xie says. \u201cSpeed and accuracy. Speedwise, this is already very quick. Each query takes just 0.1 seconds or 0.2 seconds.<\/p>\n

\u201cBut for accuracy, it\u2019s still not good enough for real applications. We are trying to improve the performance while keeping the speed fast enough. That\u2019s one of our current challenges.\u201d<\/p>\n

Again, text retrieval can serve as a model. One example would be proximity. In text search, a word that appears close to another can help flag a potential match. So, too, could image words that appear adjacent to each other.<\/p>\n

\u201cThat,\u201d Xie says, \u201cis one direction we want to go.\u201d<\/p>\n

Another, seemingly promising path appears less likely to pay off than originally thought. Xie and Jia had hoped that they could use term weighting, a text-search concept in which the relative frequency of words provides clues to the documents in which they reside.<\/p>\n

But as it turns out, while some text words are quite common and others are extremely rare, when it comes to image words, the distribution is much more even.<\/p>\n

\u201cTerm weighting is still useful,\u201d Xie says about the image-vocabulary approach, \u201cbut it\u2019s not as useful as it is with text.\u201d<\/p>\n

That\u2019s the way research goes. Not everything you try works, so you try something different instead. But, as Jia notes, he and Xie think they\u2019re onto something.<\/p>\n

\u201cIn the last 20 or 30 years,\u201d Jia says, \u201cresearchers in computer vision have wanted to focus on how to understand the content of an image. We have viewed image retrieval as a search topic, which means we don\u2019t care about the exact content of the image. We focus on the keywords.<\/p>\n

\u201cIn a text-search engine, people don\u2019t care about what the document says. They care about the keyword. We can retrieve a document very efficiently by generating a vocabulary for our image database. That\u2019s the biggest contribution of our project.\u201d<\/p>\n

For Xie, intensely interested in mobile Web search, the opportunity to work with millions of images has its own rewards.<\/p>\n

\u201cThe proudest thing for me,\u201d he says, \u201cis we have done very large-scale experiments on this image-vocabulary problem. We don\u2019t see any other researchers doing similar things. And from this kind of study, we get some very interesting conclusions.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"

By Rob Knies, Managing Editor, Microsoft Research It\u2019s fair to assume that anyone who knows anything about the Web\u2014anyone reading these words\u2014is comfortably familiar with text search. It has become perhaps the pre-eminent way to extract information from the Internet, and it is extremely simple to invoke: Type a few characters into a search box, […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194480,194460],"tags":[203269,215633,186532,186439],"research-area":[13551],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-307622","post","type-post","status-publish","format-standard","hentry","category-graphics-and-multimedia","category-search-and-information-retrieval","tag-photo2search","tag-text-search","tag-virtual-earth","tag-web-search","msr-research-area-graphics-and-multimedia","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"May 10, 2007","formattedExcerpt":"By Rob Knies, Managing Editor, Microsoft Research It\u2019s fair to assume that anyone who knows anything about the Web\u2014anyone reading these words\u2014is comfortably familiar with text search. It has become perhaps the pre-eminent way to extract information from the Internet, and it is extremely simple…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307622"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=307622"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307622\/revisions"}],"predecessor-version":[{"id":308669,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/307622\/revisions\/308669"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=307622"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=307622"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=307622"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=307622"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=307622"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=307622"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=307622"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=307622"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=307622"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=307622"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=307622"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}