Diving in the Deep End of the Web

Published

By Suzanne Ross, Writer, Microsoft Research

The Web is more complex than it seems on the surface. There is a hidden Web that lies below the Web that we see in our daily surfing. This hidden Web contains structured information dynamically generated by online Web databases that aren’t easy to access or crawl.

Researchers from Microsoft Research Asia are developing datamining techniques that they hope will make it easier to simultaneously search multiple backend databases of Web sites. This helps the average Web user by giving them search results that combine information from several sites, and it helps the average Web site owner by exposing their content to more Web searchers.

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.

“There are three major technical challenges to searching the deep Web—crawling, semantic understanding, and information integration. We’re working on techniques to solve these three problems, said Wei-Ying Ma, the research manager for the Web Search and Mining group.

A global schema for a specific domain is used to enable the data mining engine to perform information extraction and integration. It would allow you to see a Web page that would be dynamically produced to show you results from multiple sites. An example would be looking for a camera or a book. Normally you’d query a search engine, and your results would be a list of sites that might have the book or camera you want, in addition to a lot of other products that you don’t want.

A single Web page, automatically produced from a single query would allow you to compare the price, location, specs, or any other similar items without going to multiple Web pages.

In order to do this, the researchers have studied ways to automatically discover the schema of Web sites, including the query interface schema and result schema, and map these site-dependent schemas to the global schema. Others have tried this, but haven’t been too successful. A common method is to match query interfaces by identifying attribute labels from the text surrounding page elements. However, there aren’t always attributes to identify, or the text isn’t very descriptive.

The researchers began by categorizing Web sites into domain types. The domains cover the same type of product, or the same type of information. Sites that sell books are in one domain, sites that have information on the same topic, such as jobs, are in another.

Each domain has common, identifiable attributes that partially represent the data objects in the backend database. The minor differences are in the names of the attributes and the number of attributes.

To study the deep Web, the researchers send sample queries to get back parts of the underlying database content to analyze. The data records and elements from the result pages are extracted and used to identify the mapping functions between query interface schema, result schema, and the global schema based on statistical analysis.

Automatic understanding of the hidden Web would make it possible to bring together disparate parts of a growing source of information.

Continue reading

See all blog posts