Short-Lived High-Volume Bandits

Published Online:https://doi.org/10.1287/opre.2023.0557

Modern platforms leverage randomized experiments to make informed decisions from a given set of items (treatment arms). As a particularly challenging scenario, these items can (i) arrive in a high volume, with thousands of new items being released per hour, and (ii) have a short lifetime due to their transient nature. We study a Bayesian multiple-play bandit problem that encapsulates the key features of this scenario. In each round, a set of arms arrives. Each arm has a lifetime w and an unknown mean reward. The learner selects a multiset of n arms and receives observable rewards for each play. We aim to minimize the loss due to not knowing the reward rates. We show that if at most nρ arms arrive per round, then our policy has a O˜(nmin{ρ,12(1+1w)1}) loss on a sufficiently large class of prior distributions for the mean rewards. We complement this by showing that all policies suffer an Ω(nmin{ρ,12}) loss. We further validate the effectiveness of our policy through a large-scale field experiment on Glance, a content card service platform that faces exactly this challenge. A simple variant of our policy outperformed the current recommender at the time by 4.32% in total duration and 7.48% in total number of click-throughs.

Funding: R. Ravi is supported in part by the U.S. Office of Naval Research [Award Number N00014-21-1-2243] and the Air Force Office of Scientific Research [Award Number FA9550-23-1-0031]. A. Li is in part supported by NSF CAREER [Grant 2238489].

Supplemental Material: All supplemental materials, including the code, data, and files required to reproduce the results, are available at https://doi.org/10.1287/opre.2023.0557.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.