Nonstationary Bandits with Habituation and Recovery Dynamics
Abstract
Many settings involve sequential decision making where a set of actions can be chosen at each time step, each action provides a stochastic reward, and the distribution for the reward provided by each action is initially unknown. However, frequent selection of a specific action may reduce the expected reward for that action, whereas abstaining from choosing an action may cause its expected reward to increase. Such nonstationary phenomena are observed in many real-world settings such as personalized healthcare adherence–improving interventions and targeted online advertising. Though finding an optimal policy for general models with nonstationarity is PSPACE-complete, we propose and analyze a new class of models called reducing or gaining unknown efficacy (ROGUE) bandits, which we show in this paper can capture these phenomena and are amenable to the design of policies with provable properties. We first present a consistent maximum likelihood approach to estimate the parameters of these models and conduct a statistical analysis to construct finite sample concentration bounds. Using this analysis, we develop and analyze two different algorithms for optimizing ROGUE models: an upper confidence bound algorithm (ROGUE-UCB) and an ɛ-greedy algorithm (ɛ-ROGUE). Our theoretical analysis shows that under proper conditions, the ROGUE-UCB and ɛ-ROGUE algorithms can achieve logarithmic in time regret, unlike existing algorithms, which result in linear regret. We conclude with a numerical experiment using real-world data from a personalized healthcare adherence–improving intervention to increase physical activity. In this intervention, the goal is to optimize the selection of messages (e.g., confidence increasing versus knowledge increasing) to send to each individual each day to increase adherence and physical activity. Our results show that ROGUE-UCB and ɛ-ROGUE perform better in terms of aggregated regret and average reward when compared with state-of-the-art algorithms, and in the context of this intervention, the use of ROGUE-UCB increases daily step counts by roughly 1,000 steps a day (about a half-mile more of walking) compared with other algorithms in a simulation experiment.

