Text-Search Tricks Speak Volumes in Image Search

Published

By Rob Knies, Managing Editor, Microsoft Research

It’s fair to assume that anyone who knows anything about the Web—anyone reading these words—is comfortably familiar with text search. It has become perhaps the pre-eminent way to extract information from the Internet, and it is extremely simple to invoke: Type a few characters into a search box, hit Go, and see what results.

Simple for text, that is. It’s not quite so easy for images, which are proliferating on the Web these days. Sure, you can search for images, but what you get is based on the words associated with the image. The problem is that not all images are annotated, or adequately so.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.

Pure image search—based on the image itself, not words appended to it—is difficult. But it would be extremely valuable were it to be harnessed. Imagine yourself lost in an unfamiliar part of town. You pull out your camera phone, shoot a photo of your surroundings, send it to a database of images from your city, and, voilà, get a match that not only tells you where you’re at, but also provides a map that shows you how to get to where you’re going.

Sound too good to be true? Well, maybe. But don’t tell that to Xing Xie and Menglei Jia.

Xie and Jia work for Microsoft Research Asia, a lab known for its computer-vision acumen, and lately, they have been conducting experiments as part of a project called Photo2Search. The fruits of that research someday may revolutionize the way people interact with their surroundings.

The project uses a data set of more than a million photos of Seattle from Virtual Earth™. And the researchers have applied a novel idea to leverage the advances achieved in text search to enable people to apply familiar search concepts to images, as well.

“This project is about how to search millions of photos using a mobile phone,” says Xie, lead researcher for the Web Search & Mining group for Microsoft Research’s Beijing-based lab. “We could build an index for these photos, use the photo to find the most similar ones in the database, and have the location and other information returned to us.”

Such results, conceivably, could be achieved today, but at a prohibitively high computational cost. It’s just not as easy for a computer to search for images or parts of images as it is to search for less-complex words.

If you can’t beat ‘em, join ‘em. Xie and Jia have decided to pursue the path of generating “image words,” which would be collected into an “image vocabulary.” That parallel with text retrieval, they think, could help lead to similar search success.

The concept itself isn’t new. What is new, though, is applying it to a data set of a million images. Such a vast collection enables the creation of a vocabulary rich enough to be extended to other data sets, of different sizes or even different types.

“Some researchers,” says Jia, who also works at Microsoft Research Asia, “claimed that a common vocabulary is not available. But their vocabularies were based on thousands of images. They claimed that those vocabularies cannot be shared between different databases. Our contribution is that we use a million photos.”

Xie agrees.

“Our main contribution is that we use a very large-scale database to study this problem,” he says. “This is the first study on a very large-scale database, so we believe the conclusions from this study are more reliable.”

To construct an “image vocabulary,” you first must create “image words.” And just as a text word is composed of a collection of elements—letters—so is an “image word.”

“For each image,” Xie explains, “we first detect the image features, by feature-point detector. That means we can find some special points, some interest points, in the image.”

A photo of a building, for example, may have image features such as windows or doors. Each feature is represented by a vector, describing its characteristics such as orientation and intensity, and a collection of vectors becomes a set of elements that, combined, form an “image word.”

The “image word”—in actuality a cluster of image features—itself has no semantic meaning; it’s just a collection of data. But the specifics of those data, taken in aggregate, can be compiled and compared across a data set to locate similar collections that can represent the same image, even if viewed from a different angle, under different lighting conditions, or with different objects present or absent in an image.

Take two photos of a house. One has a car in front, one does not. But the number of windows, and their visual characteristics, remains constant. Identify the similarity of those constants, and you have a match.

“If we can map each image as a document,” Xie says, “then we can also build a similar search image for these image documents. And if we build a vocabulary for image features, we can map each image feature to an image word.”

Those image words have the advantage of shedding the complex, highly dimensional nature of the image itself and thus making the search fast enough to become feasible. Instead of looking for a white house with three windows on a sunny day with a car parked in front, a search engine can search for a house with three windows that are equally far apart, relatively speaking, oriented in a similar direction, and shaped alike.

The project started last summer, with Jia serving as the main developer for the effort. And with the huge data set at their disposal, he and Xie began to examine the ramifications of what they had available to them.

“Our experiment,” Xie says, “showed that a vocabulary generated from a million-size data set can achieve satisfactory accuracy when applied to either larger data sets or different types of data sets.”

One interesting result of their work is the determination that, when it comes to image vocabularies, bigger is not necessarily better.

“Actually, a larger vocabulary size can lead to a larger computational cost,” Xie says. “We’re better off to use a smaller vocabulary size when it’s possible.”

The work seemed to indicate that an image vocabulary of 300,000 words would be a perhaps optimal size to minimize analysis time while maximizing accuracy.

The project grew out of earlier work Xie had performed that involved searching databases via camera phones. But that work used data sets in the tens of thousands, not enough to lend themselves to the idea of vocabulary generation and the advantages an image vocabulary could offer.

“Photo-sharing Web sites like Flickr,” he notes, “get a million new photos each day. So the most critical thing is that we want to support very large-scale image databases.”

The project has identified further work that needs to be done before such techniques can be released to the world at large.

“There are two criteria,” Xie says. “Speed and accuracy. Speedwise, this is already very quick. Each query takes just 0.1 seconds or 0.2 seconds.

“But for accuracy, it’s still not good enough for real applications. We are trying to improve the performance while keeping the speed fast enough. That’s one of our current challenges.”

Again, text retrieval can serve as a model. One example would be proximity. In text search, a word that appears close to another can help flag a potential match. So, too, could image words that appear adjacent to each other.

“That,” Xie says, “is one direction we want to go.”

Another, seemingly promising path appears less likely to pay off than originally thought. Xie and Jia had hoped that they could use term weighting, a text-search concept in which the relative frequency of words provides clues to the documents in which they reside.

But as it turns out, while some text words are quite common and others are extremely rare, when it comes to image words, the distribution is much more even.

“Term weighting is still useful,” Xie says about the image-vocabulary approach, “but it’s not as useful as it is with text.”

That’s the way research goes. Not everything you try works, so you try something different instead. But, as Jia notes, he and Xie think they’re onto something.

“In the last 20 or 30 years,” Jia says, “researchers in computer vision have wanted to focus on how to understand the content of an image. We have viewed image retrieval as a search topic, which means we don’t care about the exact content of the image. We focus on the keywords.

“In a text-search engine, people don’t care about what the document says. They care about the keyword. We can retrieve a document very efficiently by generating a vocabulary for our image database. That’s the biggest contribution of our project.”

For Xie, intensely interested in mobile Web search, the opportunity to work with millions of images has its own rewards.

“The proudest thing for me,” he says, “is we have done very large-scale experiments on this image-vocabulary problem. We don’t see any other researchers doing similar things. And from this kind of study, we get some very interesting conclusions.”

Continue reading

See all blog posts