{"id":642246,"date":"2020-03-10T22:58:14","date_gmt":"2020-03-11T05:58:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=642246"},"modified":"2020-03-20T01:18:20","modified_gmt":"2020-03-20T08:18:20","slug":"web-crawl-scheduling","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/web-crawl-scheduling\/","title":{"rendered":"Web crawl scheduling"},"content":{"rendered":"<p>A web crawler is an essential part of a search engine that visits URLs and downloads their content, which is subsequently processed, indexed, and used by the search engine to decide which results to show the users. As the Web is becoming increasingly dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine\u2019s index, in order to keep the index fresh by picking up the pages\u2019 changed content. The problem of crawl scheduling consists in determining when to (re-)download which pages&#8217; content so as to ensure that the search engine has complete and up-to-date view of the Web.<\/p>\n<p>In this project, we aim to address the challenges of crawl scheduling as they are manifested in major commercial search engines such as <em>Microsoft Bing<\/em>. One such challenge is partial change observability: for most URLs, a search engine finds out whether they have changed only when it crawls them. To guess when the changes happen, and hence should be downloaded, a crawler needs a predictive model whose parameters are initially unknown. Crawlers need to learn these models while optimizing a freshness-related and information-completeness-related objectives when scheduling crawls, facing a phenomenon known as <em>xploration-exploitation tradeoff<\/em>. For some web pages, however, sitemap polling and other means can provide trustworthy near-instantaneous signals that the page has changed in a meaningful way, though not what the change is exactly. Even when these signals are available, crawl scheduling remains highly nontrivial because the crawler cannot react to every individual predicted or actual change. Its own infrastructure as well as hosts where web pages are located impose bandwidth constraints on the average daily number of crawls, which is usually just a fraction of the change event volume. Last but not least, <em>Bing<\/em> and other major search engines track many billions of pages with vastly different importance and change frequency characteristics. The sheer size of this constrained learning and optimization problem makes low-polynomial algorithms for it a must.<\/p>\n<p>This effort is conducted in close collaboration with Bing&#8217;s Web Platform team.<\/p>\n<p>Check out the project&#8217;s code and datasets on <a class=\"button-solid no-margin-bottom\" style=\"margin-top: 10px\" href=\"https:\/\/github.com\/microsoft\/Optimal-Freshness-Crawl-Scheduling\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A web crawler is an essential part of a search engine that visits URLs and downloads their content, which is subsequently processed, indexed, and used by the search engine to decide which results to show the users. As the Web is becoming increasingly dynamic, in addition to discovering new web pages a crawler needs to [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13555],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-642246","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[590302,596461,642234],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"related-researchers":[{"type":"user_nicename","display_name":"Andrey Kolobov","user_id":30910,"people_section":"Section name 0","alias":"akolobov"},{"type":"guest","display_name":"Cheng Lu","user_id":642306,"people_section":"Section name 0","alias":""},{"type":"user_nicename","display_name":"Eric Horvitz","user_id":32033,"people_section":"Section name 0","alias":"horvitz"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/642246","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/642246\/revisions"}],"predecessor-version":[{"id":642318,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/642246\/revisions\/642318"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=642246"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=642246"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=642246"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=642246"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=642246"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}