Collaborative Exploration and Exploitation in massively Multi-Player Bandits

  • Hiba Dakdouk ,
  • Raphaël Féraud ,
  • Nadège Varsier ,
  • Patrick Maillé ,
  • Romain Laroche

HAL

In this paper, we propose an approach to optimize the performance of Internet of Things (IoT) networks. We formulate the optimization problem as a massive multi-player multi-armed bandit problem, where the devices are the players and the radio channels are the arms, with collisions possibly preventing message reception. For handling a realistic IoT network, we do not assume that sensing information is available (i.e. that the collision are observed) or that the number of players is smaller than the number of arms. As the optimization problem is intractable, we propose two greedy policies: the first one focusing on the number of successful communications, while the second one also takes into account fairness between players. In order to implement an approximation of the targeted policies, we propose an explore-then-exploit approach, and establish a regret lower bound in Ω (T^(2/3) log T N + K^(3/2)). For estimating the mean reward of arms, we propose a decentralized exploration algorithm with controlled information exchanges between players. Then we state that the regret of the estimated target policy is optimal with respect to the time horizon T. Finally, we provide some experimental evidences that the proposed algorithms outperform several baselines.