Photo2Search: Explore the Real World via Camera Phone

Published

By Rob Knies, Managing Editor, Microsoft Research

There’s a new restaurant in town. Wonder what people are saying about it?

Take a photo.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

That handy gadget you’ve been coveting is on sale at the mall. How does its price compare to those offered elsewhere?

Snap a picture.

A new, blockbuster movie arrives at your local theater. Thumbs up, or thumbs down?

Point and shoot.

It can’t be that easy, can it? Xing Xie says that, yes, it can.

Xie, a researcher for the Web Search and Mining group within Microsoft Research Asia, is working on technology called Photo2Search, which is designed to provide information on the go for users of camera phones.

“As the old saying goes,” Xie says, “a picture is worth a thousand words.”

Maybe more, actually. Photo2Search gives users a way to search a Web-based database by using nothing more than an image captured by a cellphone equipped with a digital camera.

“This technology,” Xie says, “aims to solve the problem of mapping a physical-world object to a digital-world object. You see an object in the physical world, and you want to know the corresponding information in the digital world—for example, its price on the Web, user comments, or Web sites. There are many different solutions. You can use a bar code or radio-frequency identification. But using a picture of the object is very convenient and very easy to deploy.”

The easy part is the key. Camera phones are simple to use, but the process of text-based search on them is not. That realization provided the late-2004 genesis for Phone2Search.

“At that time,” Xie recalls, “the idea was very simple: Use a camera phone to do a Web search. This is very interesting, because inputting images is much more convenient than inputting text queries on a small device.”

Photo2Search works like this: Seeking information about something seen, a user takes a photo of the object and sends the photo, via e-mail or Multimedia Messaging Service, to a Web-based server, which searches an image database for matches. The server then delivers database information—whether it be a Web page featuring the object in the photo or information associated with the object—to the user, who can act on the information received: read a menu, enter a gallery, book a hotel room, make a purchase.

Sounds simple, right? The devil, as always, is in the details.

When Xie, who collaborated on the Photo2Search technology with some Microsoft Research Asia colleagues and a handful of visiting students, first developed his concept of image-based Web searching using photos captured using camera phones, most prior work in this area had been based on Content Based Image Retrieval (CBIR) approaches, which index the content of images by features such as color, texture, shape, object layout, and edge direction. But while such approaches, which take considerable computational resources, are able to identify photos with some similar visual features, they don’t necessarily excel at locating ones with the same prominent object or scene pictured in a query image.

“We found,” Xie says, “the precision of CBIR is not sufficient for practical use.”

Then he turned to computer-vision techniques, but again, challenges arose.

“We found speed is a very big concern,” he says. “Most computer-vision algorithms are slow.”

In the second half of 2005, Xie and colleagues rebuilt the system, with image matching based on some well-known computer-vision algorithms that extract features from images. That choice proved productive, resulting in an efficient, high-dimensional index that can search through a large image database and return results quickly—combing through a collection of 6,000 images and delivering matches in a mere three seconds using a common laptop. At that point, the process begins to enter the realm of the practical.

The searchable database still needs to be a predefined collection of images, but they can be harvested from the Web. Manual annotation and organization are then employed to enhance performance.

The promise of Xie’s technology is significant. The burgeoning consumer adoption of camera phones offers the potential for much richer Web queries, using, in addition to text, the use of images, voices, even video.

In a paper entitled “Photo-to-Search: Using Camera Phones to Inquire of the Surrounding World,” to be delivered in Japan in May during the upcoming seventh International Conference on Mobile Data Management, Xie and co-authors Mingjing Li and Wei-Ying Ma, both of Microsoft Research Asia, and Menglei Jia and Xin Fan of the University of Science and Technology of China, underscore how important camera phones could become in searching via mobile devices.

“The value of camera phones on daily information acquisition has not been sufficiently recognized by the wireless industry and researchers,” the authors state. “With necessary technologies, they [could] become a powerful tool to acquire … information [about] the surrounding world on the go.”

In an increasingly device-filled world, that could prove a boon to millions.

“The coolest thing,” Xie says, “is that you can use a pure image as a query, with no text. That is a totally new search experience.”

And further refinements are forthcoming.

“We will continue to work on efficiency and to support larger databases,” Xie says. “We hope, in the future, when a user submits a photo to, for example, MSN Spaces®, we can quickly figure out the latitude and the longitude of that photo by using our technology.”

It remains to be seen how Photo2Search might be incorporated into an upcoming product, but the possibilities are intriguing.

“There is still a lot to do to make this technology into products,” Xie cautions, adding, “If we can make it practical, then this is a big contribution for both local-search and mobile-search products.”

Continue reading

See all blog posts