Nonstationary Reinforcement Learning: The Blessing of (More) Optimism

Wang Chi Cheung
Wang Chi Cheung
[email protected]
https://orcid.org/0000-0001-9380-5673
Department of Industrial Systems Engineering and Management, National University of Singapore, 117576 Singapore;
Search for more papers by this author
,
David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;
Search for more papers by this author
,
Ruihao Zhu
Corresponding Author
Ruihao Zhu
[email protected]
https://orcid.org/0000-0003-1463-1308
SC Johnson College of Business, Cornell University, Ithaca, New York 14853
Search for more papers by this author

Department of Industrial Systems Engineering and Management, National University of Singapore, 117576 Singapore;

Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Search for more papers by this author

Ruihao Zhu

Corresponding Author

Ruihao Zhu

[email protected]

https://orcid.org/0000-0003-1463-1308

SC Johnson College of Business, Cornell University, Ithaca, New York 14853

Search for more papers by this author

Published Online:22 Feb 2023https://doi.org/10.1287/mnsc.2023.4704

References

Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 24: 25th Annual Conf. Neural Inform. Processing Systems (NIPS), 2312–2320.Google Scholar
Abbasi-Yadkori Y, Bartlett PL, Kanade V, Seldin Y, Szepesvári C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJC, Bottou L, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 26: 27th Annual Conf. Neural Inform. Processing Systems (NIPS), 2508–2516.Google Scholar
Abernethy J, Amin K, Zhu R (2016) Threshold bandits, with and without censored feedback. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems 29: Annual Conf. Neural Inform. Processing Systems (NIPS), 4889–4897.Google Scholar
Agrawal S, Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 26: 27th Annual Conf. Neural Inform. Processing Systems (NIPS), 1184–1194.Google Scholar
Agrawal S, Jia R (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
Arora R, Dekel O, Tewari A (2012) Deterministic MDPs with adversarial rewards and bandit feedback. de Freitas N, Murphy KP, eds. Proc. 28th Conf. Uncertainty Artificial Intelligence, Catalina Island, CA (AUAI Press), 93–101.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire R (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
Balseiro S, Gur Y (2019) Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Sci. 65(9):3952–3968.Link, Google Scholar
Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Bilmes JA, Ng AY, eds. UAI 2009 Proc. Twenty-Fifth Conf. Uncertainty Artificial Intelligence (AUAI Press), 35–42.Google Scholar
Bertsekas D (2017) Dynamic Programming and Optimal Control (Athena Scientific, Nashua, NH).Google Scholar
Besbes O, Gur Y, Zeevi A (2019) Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Stochastic Systems 9(4):319–337.Link, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22:222–255.Link, Google Scholar
Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, Guo D (2017) Real-time bidding by reinforcement learning in display advertising. de Rijke M, Shokouhi M, Tomkins A, Zhang M, eds. Proc. 10th ACM Internat. Conf. Web Search Data Mining (WSDM) (ACM, New York), 661–670.Google Scholar
Cardoso AR, Wang H, Xu H (2019) Large scale Markov decision processes with changing rewards. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2337–2347.Google Scholar
Chen W, Shi C, Duenyas I (2020) Optimal learning algorithms for stochastic inventory systems with random capacities. Production Oper. Management 29(7):1624–1649.Crossref, Google Scholar
Chen Y, Lee C-W, Luo H, Wei C-Y (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Beygelzimer A, Hsu D, eds. Conf. Learn. Theory, COLT (PMLR), 696–726.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019a) Hedging the drift: Learning to optimize under non-stationarity. Preprint, submitted March 4, https://arxiv.org/abs/1903.01461v1.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019b) Learning to optimize under non-stationarity. Chaudhuri K, Sugiyama M, eds. 22nd Internat. Conf. Artificial Intelligence Statist., AISTATS (PMLR), 1079–1087.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR), 1843–1854.Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms (MIT Press, Cambridge, MA).Google Scholar
den Boer A, Keskin NB (2020) Dynamic pricing with demand learning and reference effects. Management Sci. 68(10):7112–7130.Google Scholar
Dick T, György A, Szepesvári C (2014) Online learning in Markov decision processes with changing cost sequences. Proc. 31st Internat. Conf. Machine Learn. (ICML) (JMLR), 512–520.Google Scholar
Even-Dar E, Kakade SM, Mansour Y (2005) Experts in a Markov decision process. Adv. Neural Inform. Processing Systems 18 (NIPS) (Curran Associates, Inc., Red Hook, NY), 401–408.Google Scholar
Flajolet A, Jaillet P (2017) Real-time bidding with side information. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 30: Annual Conf. Neural Inform. Processing Systems (NeurIPS), 5162–5172.Google Scholar
Fruit R, Pirotta M, Lazaric A (2018a) Near optimal exploration-exploitation in non-communicating Markov decision processes. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2998–3008.Google Scholar
Fruit R, Pirotta M, Lazaric A (2019) Improved analysis of UCRL2 with empirical Bernstein inequality. Preprint, submitted July 10, https://arxiv.org/abs/2007.05456.Google Scholar
Fruit R, Pirotta M, Lazaric A, Ortner R (2018b) Efficient bias-span-constrained exploration-exploitation in reinforcement learning. Dy JG, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. Proc. Machine Learn. Res. (ICML), vol. 80 (PMLR), 1573–1581.Google Scholar
Gajane P, Ortner R, Auer P (2018) A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Preprint, submitted May 25, https://arxiv.org/abs/1805.10066.Google Scholar
Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Algorithmic Learning Theory 22nd Internat. Conf. ALT 2011 (Springer, Berlin), 174–188.Google Scholar
Google (2011) The arrival of real-time bidding. Accessed February 11, 2023, https://www.rtbchina.com/wp-content/uploads/2012/03/Google-White-Paper-The-Arrival-of-Real-Time-Bidding-July-2011.pdf.Google Scholar
Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 4967–4977.Google Scholar
Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27: Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
Han Y, Zhou Z, Weissman T (2020) Optimal no-regret learning in repeated first-price auctions. Preprint, submitted May 8, https://arxiv.org/abs/2003.09795v4.Google Scholar
Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11:1563–1600.Google Scholar
Jin C, Jin T, Luo H, Sra S, Yu T (2020) Learning adversarial Markov decision processes with bandit feedback and unknown transition. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (ICML) (PMLR), 4860–4869.Google Scholar
Karnin Z, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems 29: Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
Keskin N, Zeevi A (2016) Chasing demand: Learning and earning in a changing environments. Math. Oper. Res. 42(2):277–307.Link, Google Scholar
Keskin NB, Li M (2022) Selling quality-differentiated products in a Markovian market with unknown transition probabilities. Preprint, submitted November 28, http://dx.doi.org/10.2139/ssrn.3526568.Google Scholar
Kim MJ, Lim AEB (2016) Robust multiarmed bandit problems. Management Sci. 62(1):264–285.Google Scholar
Lattimore T, Szepesvári C (2018) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
Li Y, Zhong A, Qu G, Li N (2019) Online Markov decision processes with time-varying transition probabilities and rewards. ICML Workshop Real-World Sequential Decision Making.Google Scholar
Luo H, Wei C, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. Bubeck S, Perchet V, Rigollet P, eds. Proc. Conf. Learn. Theory (COLT) (PMLR), 1739–1776.Google Scholar
Ma W (2018) Improvements and generalizations of stochastic knapsack and markovian bandits approximation algorithms. Math. Oper. Res. 43(3):789–812.Google Scholar
Neu G, György A, Szepesvári C, Antos A (2010) Online Markov decision processes under Bandit feedback. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems 23: 24th Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 1804–1812.Google Scholar
Neu G, Gyorgy A, Szepesvari C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami MA, eds. Proc. 15th Conf. Artificial Intelligence Statist., vol. 22 (PMLR), 805–813.Google Scholar
Ortner R, Gajane P, Auer P (2019) Variational regret bounds for reinforcement learning. Globerson A, Silva R, eds. Proc. 35th Conf. Uncertainty Artificial Intelligence, UAI (AUAI Press), 81–90.Google Scholar
Qin Z, Zhu H, Ye J (2022) Reinforcement learning for ridesharing: An extended survey. Transportation Res. Part C Emerging Tech. 144:103852.Crossref, Google Scholar
Rosenberg A, Mansour Y (2019) Online convex optimization in adversarial Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR), 5478–5486.Google Scholar
Sidford A, Wang M, Wu X, Ye Y (2018a) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. 29th Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM), 770–787.Google Scholar
Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018b) Near-optimal time and sample complexities for solving discounted Markov decision process with a generative model. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 5192–5202.Google Scholar
Simchi-Levi D, Chen X, Bramel J (2014) The Logic of Logistics: Theory, Algorithms, and Applications for Logistics and Supply Chain Management, 3rd ed. (Springer, New York).Crossref, Google Scholar
Treharne JT, Sox CR (2002) Adaptive inventory control for nonstationary demand and partial information. Management Sci. 48(5):607–624.Google Scholar
Wang M (2019) Randomized linear programming solves the Markov decision problem in nearly-linear (sometimes sublinear) running time. Math. Oper. Res. 45(2):517–546.Google Scholar
Wei C-Y, Jafarnia-Jahromi M, Luo H, Sharma H, Jain R (2020) Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. Proc. 37th Internat. Conf. Machine Learn. (ICML), vol. 119, Virtual Event (PMLR), 10170–10180.Google Scholar
Yu JY, Mannor S (2009) Online learning in Markov decision processes with arbitrarily changing rewards and transitions. Basar T, Özbay H, eds. 1st Internat. Conf. Game Theory Networks, GAMENETS (IEEE, Piscataway, NJ), 314–322.Google Scholar
Yu JY, Mannor S, Shimkin N (2009) Markov decision processes with arbitrary reward processes. Math. Oper. Res. 34:737–757.Link, Google Scholar
Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.Link, Google Scholar
Zhang H, Chao X, Shi C (2018) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.Google Scholar
Zhang Z, Ji X (2019) Regret minimization for reinforcement learning by evaluating the optimal bias function. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2823–2832.Google Scholar
Zhou X, Chen N, Gao X, Xiong Y (2020) Regime switching bandits. Preprint, submitted January 26, https://arxiv.org/abs/2001.09390v1.Google Scholar

Volume 69, Issue 10

October 2023

Pages 5695-6415, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:March 18, 2020
Accepted:January 22, 2023
Published Online:February 22, 2023

Cite as

Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu (2023) Nonstationary Reinforcement Learning: The Blessing of (More) Optimism. Management Science 69(10):5722-5739.

https://doi.org/10.1287/mnsc.2023.4704

Keywords

Acknowledgments

The authors thank the department editor J. George Shanthikumar, the anonymous associate editor, and three anonymous referees whose comments improved the manuscript. The authors express gratitude to Dylan Foster and Mengdi Wang as well as various seminar attendees for helpful discussions and comments. Part of the work was completed while R. Zhu was associated with Purdue Krannert School of Management.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Nonstationary Reinforcement Learning: The Blessing of (More) Optimism

References

Volume 69, Issue 10

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News