{"id":170006,"date":"2008-11-23T19:01:32","date_gmt":"2008-11-23T19:01:32","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/multi-armed-bandits\/"},"modified":"2021-11-11T17:24:18","modified_gmt":"2021-11-12T01:24:18","slug":"multi-armed-bandits","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/multi-armed-bandits\/","title":{"rendered":"Multi-Armed Bandits"},"content":{"rendered":"<div class=\"asset-content\">This is an umbrella project for several related efforts at Microsoft Research Silicon Valley that address various Multi-Armed Bandit (MAB) formulations motivated by web search and ad placement. The MAB problem is a classical paradigm in Machine Learning in which an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies.<\/div>\n<p><!-- .asset-content --><\/p>\n<div id=\"en-usprojectsbanditsdefault\" class=\"page-content\">\n<p><em>This page is inactive since the closure of MSR-SVC in September 2014.<br \/>\n<\/em><\/p>\n<\/div>\n<div id=\"en-usprojectsbanditsdefault\" class=\"page-content\">\n<p>The name &#8220;multi-armed bandits&#8221; comes from a\u00a0whimsical scenario in which a gambler faces\u00a0several slot machines, a.k.a. &#8220;one-armed bandits&#8221;, that look identical at first\u00a0but\u00a0produce different expected\u00a0winnings. The crucial issue here is the trade-off between acquiring new information (<em>exploration<\/em>) and capitalizing on the information available so far (<em>exploitation<\/em>). While the MAB problems have been studied extensively in Machine Learning, Operations Research and Economics, many exciting questions\u00a0are open. One aspect that\u00a0we are particularly interested in concerns\u00a0modeling and\u00a0efficiently using various types of side information that\u00a0may be\u00a0available to the algorithm.<\/p>\n<p><u>Contact<\/u>: <a class=\"invalidLink\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/slivkins\/\" target=\"_self\" rel=\"noopener\">Alex Slivkins<\/a>.<\/p>\n<h1>Research directions<\/h1>\n<ul>\n<li>MAB with similarity information<\/li>\n<li>MAB in a changing environment<\/li>\n<li>Explore-exploit\u00a0tradeoff in mechanism design<\/li>\n<li>Explore-exploit learning with limited resources<\/li>\n<li>Risk vs. reward tradeoff in MAB<\/li>\n<\/ul>\n<\/div>\n<div id=\"en-usprojectsbanditsdefault\" class=\"page-content\">\n<h1>External visitors and collaborators<\/h1>\n<p><a href=\"http:\/\/www.princeton.edu\/~sbubeck\/\" target=\"_self\" rel=\"noopener\">Prof. S\u00e9bastien Bubeck<\/a> (Princeton)<br \/>\n<a href=\"http:\/\/www.cs.cornell.edu\/~rdk\" target=\"_self\" rel=\"noopener\">Prof. Robert Kleinberg<\/a>\u00a0(Cornell)<br \/>\n<a href=\"http:\/\/radlinski.org\/\" target=\"_self\" rel=\"noopener\">Filip Radlinski<\/a>\u00a0(MSR Cambridge)<br \/>\n<a href=\"http:\/\/www.cs.brown.edu\/~eli\/\" target=\"_self\" rel=\"noopener\">Prof. Eli Upfal<\/a>\u00a0(Brown)<\/p>\n<p><strong>Former interns<\/strong><br \/>\n<a href=\"http:\/\/www.cs.cornell.edu\/w8\/~yogi\/v\/index.html\">Yogi Sharma<\/a> (Cornell \u2014> Facebook; intern at MSR-SV in summer 2008)<br \/>\n<a href=\"http:\/\/www.cs.princeton.edu\/~usyed\/\" target=\"_self\" rel=\"noopener\">Umar Syed<\/a>\u00a0(Princeton\u00a0\u2014>\u00a0Google; intern at MSR-SV in\u00a0summer 2008)<br \/>\n<a href=\"http:\/\/www-bcf.usc.edu\/~shaddin\/\" target=\"_self\" rel=\"noopener\">Shaddin Dughmi<\/a>\u00a0(Stanford \u2014>USC; intern at MSR-SV in summer 2010)<br \/>\n<a href=\"http:\/\/www.cs.cornell.edu\/~ashwin85\/\" target=\"_self\" rel=\"noopener\">Ashwinkumar Badanidiyuru<\/a>\u00a0(Cornell &#8211;> Google;\u00a0intern at MSR-SV in summer 2012)<\/p>\n<h1>MAB\u00a0problems with similarity information<\/h1>\n<ul>\n<li>\n<div><strong><a href=\"http:\/\/arxiv.org\/abs\/0809.4882\" target=\"_self\" rel=\"noopener\">Multi-armed bandits in metric spaces<\/a><\/strong><br \/>\nRobert Kleinberg, Alex Slivkins and Eli Upfal (<a href=\"http:\/\/webhome.csc.uvic.ca\/~stoc2008\/\" target=\"_self\" rel=\"noopener\">STOC 2008<\/a>)<br \/>\n<u>Abstract<\/u> We introduce\u00a0a version of the stochastic\u00a0MAB problem, possibly with a very large set of arms, in which the expected payoffs obey a Lipschitz condition with respect to a given metric space. The goal is to minimize regret as a function of time, both in the worst case and for &#8216;nice&#8217; problem instances.<\/div>\n<\/li>\n<li><strong><a href=\"http:\/\/arxiv.org\/abs\/0911.1174\" target=\"_self\" rel=\"noopener\">Sharp dichotomies for regret minimization in metric spaces<\/a><\/strong><br \/>\nRobert Kleinberg and Alex Slivkins (<a href=\"http:\/\/www.siam.org\/meetings\/da10\/\" target=\"_self\" rel=\"noopener\">SODA 2010<\/a>)<br \/>\n<u>Abstract<\/u> We focus on the connections between online learning and metric topology. The main result that the worst-case regret is either O(log t) or at least sqrt{t}, depending on whether the completion of the metric space is compact and countable. We prove a number of other dichotomy-style results, and extend them to the full-feedback (experts) version.<\/li>\n<li><strong><a href=\"http:\/\/arxiv.org\/abs\/1005.5197\" target=\"_self\" rel=\"noopener\">Learning optimally diverse rankings over large document collections<\/a><br \/>\n<\/strong>Alex Slivkins, Filip Radlinski and Sreenivas Gollapudi (<a href=\"http:\/\/www.icml2010.org\/\" target=\"_self\" rel=\"noopener\">ICML 2010<\/a>)<br \/>\n<u>Abstract<\/u> We present a learning-to-rank framework for web search that incorporates similarity and correlation between documents and thus, unlike prior work, scales to large document collections.<\/li>\n<li><a title=\"\" href=\"http:\/\/arxiv.org\/abs\/0907.3986\" target=\"_self\" rel=\"noopener\"><strong>Contextual bandits with similarity information<\/strong><\/a><br \/>\nAlex Slivkins (<a href=\"http:\/\/colt2011.sztaki.hu\/\" target=\"_self\" rel=\"noopener\">COLT 2011<\/a>)<br \/>\n<u>Abstract<\/u> In the &#8216;contextual bandits&#8217; setting, in each round nature reveals a &#8216;context&#8217; x, algorithm chooses an &#8216;arm&#8217; y, and the expected payoff is \u00b5(x,y). Similarity info is expressed by a metric space over the (x,y) pairs such that \u00b5 is a Lipschitz function. Our algorithms are based on adaptive (rather than uniform) partitions of the metric space which are adjusted to the popular and high-payoff regions.<\/li>\n<li><a class=\"invalidLink\" href=\"http:\/\/research.microsoft.com\/en-us\/um\/people\/slivkins\/nips11.pdf\"><strong>Multi-armed bandits on implicit metric spaces<\/strong><\/a><br \/>\nAlex Slivkins (<a href=\"http:\/\/nips.cc\/Conferences\/2011\/\">NIPS 2011<\/a>)<br \/>\n<u>Abstract<\/u> Suppose an MAB algorithm is given a tree-based classification of arms. This tree implicitly defines a &#8220;similarity distance&#8221; between arms, but the numeric distances are not revealed to the algorithm. Our algorithm (almost) matches the best known guarantees for the setting (Lipschitz MAB) in which the distances are revealed.<\/li>\n<\/ul>\n<h1>MAB problems in a changing environment<\/h1>\n<ul>\n<li><a class=\"invalidLink\" href=\"http:\/\/research.microsoft.com\/en-us\/um\/people\/slivkins\/colt08.pdf\"><b>Adapting to a stochastically changing environment<\/b><\/a><br \/>\nAlex Slivkins and Eli Upfal (<a href=\"http:\/\/colt2008.cs.helsinki.fi\/\">COLT 2008<\/a>)<br \/>\n<u>Abstract<\/u> We study a version of the stochastic MAB problem in which the expected reward of each arm evolves stochastically and gradually in time, following an independent Brownian motion or a similar process. Our benchmark is a hypothetical policy that chooses the best arm in each round.<\/li>\n<li><a href=\"http:\/\/arxiv.org\/abs\/1007.3799\"><b>Adapting to the Shifting Intent of Search Queries<\/b><\/a><br \/>\nUmar Syed, Alex Slivkins and Nina Mishra (<a href=\"http:\/\/nips.cc\/Conferences\/2009\/\">NIPS 2009<\/a>)<br \/>\n<u>Abstract<\/u> Query intent may shift over time. A classifier can use the available signals to predict a shift in intent. Then a bandit algorithm can be used to find the new relevant results. We present a meta-algorithm that combines such classifier with a bandit algorithm in a feedback loop.<\/li>\n<li><a title=\"\" href=\"http:\/\/arxiv.org\/abs\/0907.3986\" target=\"_self\" rel=\"noopener\"><strong>Contextual bandits with similarity information<\/strong><\/a><br \/>\nAlex Slivkins (<a href=\"http:\/\/colt2011.sztaki.hu\/\" target=\"_self\" rel=\"noopener\">COLT 2011<\/a>)<br \/>\n<u>Abstract<\/u> Interpreting the current time as a part of the contextual information, we obtain a very general bandit framework that (in addition to similarity between arms and contexts) can include slowly changing payoffs and variable sets of arms.<\/li>\n<li><a href=\"http:\/\/arxiv.org\/abs\/1202.4473\" target=\"_self\" rel=\"noopener\"><strong>The best of both worlds: stochastic and adversarial bandits<\/strong><\/a><br \/>\nSebastien Bubeck and Alex Slivkins (<a title=\"\" href=\"http:\/\/www.ttic.edu\/colt2012\/\" target=\"_self\" rel=\"noopener\">COLT 2012<\/a>)<br \/>\n<u>Abstract<\/u> We present a new bandit algorithm whose regret is optimal both for adversarial rewards and for stochastic rewards, achieving, resp., square-root regret and polylog regret.<\/li>\n<\/ul>\n<h1>Explore-exploit tradeoff in mechanism design<\/h1>\n<ul>\n<li>\n<div><strong><a href=\"http:\/\/arxiv.org\/abs\/0812.2291\" target=\"_self\" rel=\"noopener\">Characterizing truthful multi-armed bandit mechanisms<\/a><br \/>\n<\/strong>Moshe Babaioff, Alex Slivkins and Yogi Sharma (<a href=\"http:\/\/www.sigecom.org\/ec09\/\" target=\"_self\" rel=\"noopener\">EC 2009<\/a>)<br \/>\n<u>Abstract<\/u> We consider a multi-round auction setting motivated by pay-per-click auctions in the Internet advertising, which can be viewed as a strategic version of the MAB problem. We investigate how the design of MAB algorithms is affected by the restriction of truthfulness. We show striking differences in terms of both the structure of an algorithm and its regret.<em>\u00a0 <\/em><\/div>\n<\/li>\n<li>\n<div><b><a href=\"http:\/\/arxiv.org\/abs\/1004.3630\" target=\"_self\" rel=\"noopener\">Truthful mechanisms with implicit payment computation<\/a><\/b><br \/>\nMoshe Babaioff, Robert Kleinberg and Alex Slivkins (<a href=\"http:\/\/www.sigecom.org\/ec10\/\">EC 2010<\/a> <strong>Best Paper Award<\/strong>)<br \/>\n<u>Abstract<\/u> We show that any single-parameter, monotone\u00a0allocation\u00a0function can be\u00a0truthfully implemented using\u00a0a single call to\u00a0this\u00a0function. We apply this to MAB mechanisms.<\/div>\n<\/li>\n<li>\n<div><span id=\"AGT\" class=\"bib-block\"><a class=\"bib-title invalidLink\" title=\"\" href=\"http:\/\/research.microsoft.com\/en-us\/um\/people\/slivkins\/MonotoneMAB-colt11\" target=\"_self\" rel=\"noopener\"><strong>Monotone multi-armed bandit allocations<\/strong><\/a><br \/>\nAlex Slivkins (Open Question at <a href=\"http:\/\/colt2011.sztaki.hu\/\"><u>COLT 2011<\/u><\/a>)<br \/>\n<span class=\"bib-abstract\"><u>Abstract<\/u> The reduction in the EC&#8217;10 paper opens up a problem of designing monotone\u00a0MAB allocation rules, a new twist on the familiar\u00a0MAB problem.<\/span><\/span><\/div>\n<\/li>\n<li><strong><a href=\"http:\/\/arxiv.org\/abs\/1302.4138\" target=\"_self\" rel=\"noopener\">Multi-parameter mechanisms with implicit payment computation<\/a><\/strong><br \/>\nMoshe Babaioff, Robert Kleinberg and Alex Slivkins (<a href=\"http:\/\/www.sigecom.org\/ec13\/\" target=\"_self\" rel=\"noopener\">EC 2013<\/a>)<br \/>\n<u>Abstract<\/u> We\u00a0generalize the main\u00a0result of the EC&#8217;10 paper to the multi-parameter setting. We apply this to a natural multi-parameter extension of MAB mechanisms.<\/li>\n<\/ul>\n<h1>Explore-exploit learning with\u00a0limited resources<\/h1>\n<ul>\n<li>\n<div><a href=\"http:\/\/arxiv.org\/abs\/1108.4142\"><b>Dynamic pricing with limited supply<\/b><\/a><br \/>\nMoshe Babaioff, Shaddin Dughmi, Robert Kleinberg and Alex Slivkins (<a href=\"http:\/\/www.sigecom.org\/ec12\/\">EC 2012<\/a>)<br \/>\n<u>Abstract<\/u> We consider dynamic pricing with limited supply and unknown demand distribution. We extend MAB techniques to the limited supply setting, and obtain optimal regret rates.<\/div>\n<\/li>\n<li><a href=\"http:\/\/arxiv.org\/abs\/1305.2545\"><strong>Bandits with Knapsacks<\/strong><\/a><br \/>\nAshwinkumar Badanidiyuru, Robert Kleinberg and Alex Slivkins (<a href=\"http:\/\/simons.berkeley.edu\/focs2013.html\">FOCS 2013<\/a>)<br \/>\n<u>Abstract<\/u> We define a broad class of explore-exploit problems with knapsack-style resource constraints, which subsumes dynamic pricing, dynamic procurement, pay-per-click ad allocation, and\u00a0many other problems. Our algorithms achieve optimal regret w.r.t. the optimal dynamic policy.<\/li>\n<\/ul>\n<h1>Risk vs. reward trade-off in MAB<\/h1>\n<ul>\n<li><a href=\"http:\/\/arxiv.org\/abs\/1008.3672\"><b>Prediction Strategies without loss<\/b><\/a><br \/>\nMichael Kapralov and\u00a0Rina Panigrahy (<a href=\"http:\/\/nips.cc\/Conferences\/2011\/\" target=\"_self\" rel=\"noopener\">NIPS 2011<\/a>)<br \/>\n<u>Abstract<\/u> We show that\u00a0it is theoretically possible to extract some reward in a bandit prediction game while having an exponentially small downside risk.<\/li>\n<li><a href=\"http:\/\/arxiv.org\/abs\/1305.1359\" target=\"_self\" rel=\"noopener\"><strong>Differential Equations Approach to Optimal Regret<\/strong><\/a> (2013)<br \/>\nAlex Andoni and Rina Panigrahy<br \/>\n<u>Abstract<\/u> This paper\u00a0shows the role of Hermite Differential Equation in optimal risk vs reward tradeoff in prediction games.<\/li>\n<\/ul>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This is an umbrella project for several related efforts at Microsoft Research Silicon Valley that address various Multi-Armed Bandit (MAB) formulations motivated by web search and ad placement. The MAB problem is a classical paradigm in Machine Learning in which an online algorithm chooses from a set of strategies in a sequence of trials so [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13561,13556,13548],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-170006","msr-project","type-msr-project","status-publish","hentry","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-economics","msr-locale-en_us","msr-archive-status-complete"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","value":"slivkins","display_name":"Alex Slivkins","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/slivkins\/\" aria-label=\"Visit the profile page for Alex Slivkins\">Alex Slivkins<\/a>","is_active":false,"user_id":33685,"last_first":"Slivkins, Alex","people_section":0,"alias":"slivkins"}],"msr_research_lab":[199571],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170006","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170006\/revisions"}],"predecessor-version":[{"id":795062,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170006\/revisions\/795062"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=170006"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=170006"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=170006"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=170006"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=170006"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}