Nonstationary Bandits with Habituation and Recovery Dynamics

Yonatan Mintz
Corresponding Author
Yonatan Mintz
[email protected]
https://orcid.org/0000-0002-0670-1794
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;
Search for more papers by this author
,
Anil Aswani
Corresponding Author
Anil Aswani
[email protected]
Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94720;
Search for more papers by this author
,
Philip Kaminsky
Corresponding Author
Philip Kaminsky
[email protected]
https://orcid.org/0000-0002-3079-0299
Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94720;
Search for more papers by this author
,
Elena Flowers
Corresponding Author
Elena Flowers
[email protected]
Department of Physiological Nursing, School of Nursing, University of California, San Francisco, San Francisco, California 94143;
Search for more papers by this author
,
Yoshimi Fukuoka
Corresponding Author
Yoshimi Fukuoka
[email protected]
Department of Physiological Nursing & Institute for Health & Aging, School of Nursing, University of California, San Francisco, San Francisco, California 94143
Search for more papers by this author

Yonatan Mintz

Corresponding Author

Yonatan Mintz

[email protected]

https://orcid.org/0000-0002-0670-1794

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;

Search for more papers by this author

Anil Aswani

Corresponding Author

Anil Aswani

[email protected]

Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94720;

Search for more papers by this author

Philip Kaminsky

Corresponding Author

Philip Kaminsky

[email protected]

https://orcid.org/0000-0002-3079-0299

Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94720;

Search for more papers by this author

Elena Flowers

Corresponding Author

Elena Flowers

[email protected]

Department of Physiological Nursing, School of Nursing, University of California, San Francisco, San Francisco, California 94143;

Search for more papers by this author

Yoshimi Fukuoka

Corresponding Author

Yoshimi Fukuoka

[email protected]

Department of Physiological Nursing & Institute for Health & Aging, School of Nursing, University of California, San Francisco, San Francisco, California 94143

Search for more papers by this author

Published Online:9 Jul 2020https://doi.org/10.1287/opre.2019.1918

Abstract

Many settings involve sequential decision making where a set of actions can be chosen at each time step, each action provides a stochastic reward, and the distribution for the reward provided by each action is initially unknown. However, frequent selection of a specific action may reduce the expected reward for that action, whereas abstaining from choosing an action may cause its expected reward to increase. Such nonstationary phenomena are observed in many real-world settings such as personalized healthcare adherence–improving interventions and targeted online advertising. Though finding an optimal policy for general models with nonstationarity is PSPACE-complete, we propose and analyze a new class of models called reducing or gaining unknown efficacy (ROGUE) bandits, which we show in this paper can capture these phenomena and are amenable to the design of policies with provable properties. We first present a consistent maximum likelihood approach to estimate the parameters of these models and conduct a statistical analysis to construct finite sample concentration bounds. Using this analysis, we develop and analyze two different algorithms for optimizing ROGUE models: an upper confidence bound algorithm (ROGUE-UCB) and an ɛ-greedy algorithm (ɛ-ROGUE). Our theoretical analysis shows that under proper conditions, the ROGUE-UCB and ɛ-ROGUE algorithms can achieve logarithmic in time regret, unlike existing algorithms, which result in linear regret. We conclude with a numerical experiment using real-world data from a personalized healthcare adherence–improving intervention to increase physical activity. In this intervention, the goal is to optimize the selection of messages (e.g., confidence increasing versus knowledge increasing) to send to each individual each day to increase adherence and physical activity. Our results show that ROGUE-UCB and ɛ-ROGUE perform better in terms of aggregated regret and average reward when compared with state-of-the-art algorithms, and in the context of this intervention, the use of ROGUE-UCB increases daily step counts by roughly 1,000 steps a day (about a half-mile more of walking) compared with other algorithms in a simulation experiment.

Volume 68, Issue 5

September-October 2020

Pages iii-vi, 1285-1624, C2-C3

Article Information

Metrics

Information

Received:December 20, 2017
Accepted:June 28, 2019
Published Online:July 09, 2020

Cite as

Yonatan Mintz, Anil Aswani, Philip Kaminsky, Elena Flowers, Yoshimi Fukuoka (2020) Nonstationary Bandits with Habituation and Recovery Dynamics. Operations Research 68(5):1493-1516.

https://doi.org/10.1287/opre.2019.1918

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Nonstationary Bandits with Habituation and Recovery Dynamics

Abstract

Volume 68, Issue 5

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News