Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control

Weichao Mao
Weichao Mao
[email protected]
https://orcid.org/0000-0001-8301-4173
Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author
,
Kaiqing Zhang
Kaiqing Zhang
[email protected]
https://orcid.org/0000-0002-7446-7581
Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, Maryland 20740
Search for more papers by this author
,
Ruihao Zhu
Corresponding Author
Ruihao Zhu
[email protected]
https://orcid.org/0000-0003-1463-1308
Cornell SC Johnson College of Business & Nolan School of Hotel Administration, Ithaca, New York 14853
Search for more papers by this author
,
David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Tamer Başar
Tamer Başar
[email protected]
https://orcid.org/0000-0003-4406-7875
Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author

Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Kaiqing Zhang

[email protected]

https://orcid.org/0000-0002-7446-7581

Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, Maryland 20740

Search for more papers by this author

Ruihao Zhu

Corresponding Author

Ruihao Zhu

[email protected]

https://orcid.org/0000-0003-1463-1308

Cornell SC Johnson College of Business & Nolan School of Hotel Administration, Ithaca, New York 14853

Search for more papers by this author

David Simchi-Levi

[email protected]

https://orcid.org/0000-0002-4650-1519

Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Tamer Başar

[email protected]

https://orcid.org/0000-0003-4406-7875

Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Published Online:14 May 2024https://doi.org/10.1287/mnsc.2022.02533

References

Agarwal A, Henaff M, Kakade S, Sun W (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Preprint, submitted July 16, https://arxiv.org/abs/2007.08459.Google Scholar
Agrawal S, Jia R (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. 2019 ACM Conf. Econom. Comput. (Association for Computing Machinery, New York), 743–744.Google Scholar
Allesiardo R, Féraud R, Maillard O-A (2017) The non-stationary stochastic multi-armed bandit problem. Internat. J. Data Sci. Anal. 3(4):267–283.Crossref, Google Scholar
Arora R, Dekel O, Tewari A (2012) Deterministic MDPs with adversarial rewards and bandit feedback. Proc. Twenty-Eighth Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 93–101.Google Scholar
Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans. Automat. Control 62(4):1545–1558.Crossref, Google Scholar
Auer P, Gajane P, Ortner R (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 138–158.Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
Bai Y, Xie T, Jiang N, Wang Y-X (2019) Provably efficient Q-learning with low switching cost. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8004–8013.Google Scholar
Balseiro SR, Gur Y (2019) Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Sci. 65(9):3952–3968.Link, Google Scholar
Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Proc. 27th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
Besbes O, Gur Y, Zeevi A (2019) Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems 9(4):319–337.Link, Google Scholar
Birge JR, Chen H, Keskin NB, Ward A (2024) To interfere or not to interfere: Information revelation and price-setting incentives in a multiagent learning environment. Oper. Res., ePub ahead of print March 27, https://doi.org/10.1287/opre.2023.0363.Link, Google Scholar
Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, Guo D (2017) Real-time bidding by reinforcement learning in display advertising. Proc. Tenth ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 661–670.Google Scholar
Chawla S, Devanur NR, Karlin AR, Sivan B (2016) Simple pricing schemes for consumers with evolving values. Proc. Twenty-Seventh Annual ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 1476–1490.Google Scholar
Chen N, Wang C, Wang L (2020a) Learning and optimization with seasonal patterns. Preprint, submitted May 16, https://arxiv.org/abs/2005.08088.Google Scholar
Chen Y, Lee C-W, Luo H, Wei C-Y (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 696–726.Google Scholar
Chen C, Wei H, Xu N, Zheng G, Yang M, Xiong Y, Xu K, Li Z (2020b) Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3414–3421.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019a) Hedging the drift: Learning to optimize under non-stationarity. Preprint, submitted March 4, https://arxiv.org/abs/1903.01461.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019b) Learning to optimize under non-stationarity. Proc. Twenty-Second Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1079–1087.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019c) Non-stationary reinforcement learning: The blessing of (more) optimism. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3397818.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Preprint, submitted June 24, https://arxiv.org/abs/2006.14389.Google Scholar
Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5527–5540.Google Scholar
Dick T, Gyorgy A, Szepesvari C (2014) Online learning in Markov decision processes with changing cost sequences. Proc. 31st Internat. Conf. Machine Learn. (PMLR, New York), 512–520.Google Scholar
Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Proc. 40th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 3538–3546.Google Scholar
Fei Y, Yang Z, Wang Z, Xie Q (2020) Dynamic regret of policy optimization in non-stationary environments. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 6743–6754.Google Scholar
Gajane P, Ortner R, Auer P (2018) A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Preprint, submitted May 25, https://arxiv.org/abs/1805.10066.Google Scholar
Gao M, Xie T, Du SS, Yang LF (2021) A provably efficient algorithm for linear Markov decision process with low switching cost. Preprint, submitted January 2, https://arxiv.org/abs/2101.00494.Google Scholar
Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Proc. 22nd Internat. Conf. Algorithmic Learn. Theory (Springer-Verlag, Berlin, Heidelberg), 174–188.Google Scholar
Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Link, Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4868–4878.Google Scholar
Jin C, Jin T, Luo H, Sra S, Yu T (2019) Learning adversarial MDPs with bandit feedback and unknown transition. Preprint, submitted December 3, https://arxiv.org/abs/1912.01192.Google Scholar
Karnin ZS, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 199–207.Google Scholar
Keskin NB, Zeevi A (2017) Chasing demand: Learning and earning in a changing environment. Math. Oper. Res. 42(2):277–307.Link, Google Scholar
Lee C-W, Luo H, Wei C-Y, Zhang M (2020) Linear last-iterate convergence for matrix games and stochastic games. Preprint, submitted June 16, https://arxiv.org/abs/2006.09517v1.Google Scholar
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. Proc. Eleventh Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco), 157–163.Google Scholar
Lu J, Yang C, Gao X, Wang L, Li C, Chen G (2019) Reinforcement learning with sequential information clustering in real-time bidding. Proc. 28th ACM Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 1633–1641.Google Scholar
Luo H, Wei C-Y, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. Proc. 31st Conf. Learn. Theory (PMLR, New York), 1739–1776.Google Scholar
Lykouris T, Simchowitz M, Slivkins A, Sun W (2019) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted November 20, https://arxiv.org/abs/1911.08689.Google Scholar
Ma W (2018) Improvements and generalizations of stochastic knapsack and Markovian bandits approximation algorithms. Math. Oper. Res. 43(3):789–812.Link, Google Scholar
Mao W, Zhang K, Xie Q, Başar T (2020) POLY-HOOT: Monte-Carlo planning in continuous space MDPs with non-asymptotic analysis. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4549–4559.Google Scholar
Mao W, Zhang K, Zhu R, Simchi-Levi D, Basar T (2021) Near-optimal model-free reinforcement learning in non-stationary episodic MDPs. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 7447–7458.Google Scholar
Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 6961–6971.Google Scholar
Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 1804–1812.Google Scholar
Ortner R, Gajane P, Auer P (2020) Variational regret bounds for reinforcement learning. Proc. 35th Uncertainty Artificial Intelligence Conf., vol. 115 (PMLR, New York), 81–90.Google Scholar
Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
Padakandla S (2020) A survey of reinforcement learning algorithms for dynamically varying environments. Preprint, submitted May 19, https://arxiv.org/abs/2005.10619.Google Scholar
Radanovic G, Devidze R, Parkes D, Singla A (2019) Learning to collaborate in Markov decision processes. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 5261–5270.Google Scholar
Roughgarden T (2009) Intrinsic robustness of the price of anarchy. Proc. 41st Annual ACM Sympos. Theory Comput. (Association for Computing Machinery, New York), 513–522.Google Scholar
Shi C, Chen W, Duenyas I (2016) Nonparametric data-driven algorithms for multiproduct inventory systems with censored demand. Oper. Res. 64(2):362–370.Link, Google Scholar
Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1–2):109–136.Crossref, Google Scholar
Syrgkanis V, Agarwal A, Luo H, Schapire RE (2015) Fast convergence of regularized learning in games. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2989–2997.Google Scholar
Tekin C, Liu M (2010) Online algorithms for the multi-armed bandit problem with Markovian rewards. Proc. 48th Annual Allerton Conf. Comm. Control Comput. (Allerton) (IEEE, Piscataway, NJ), 1675–1682.Google Scholar
Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 10279–10288.Google Scholar
Touati A, Vincent P (2020) Efficient learning in non-stationary linear Markov decision processes. Preprint, submitted October 24, https://arxiv.org/abs/2010.12870.Google Scholar
Wang J, Liu Y, Li B (2020) Reinforcement learning with perturbed rewards. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34, no. 04 (AAAI Press, Palo Alto, CA), 6202–6209.Google Scholar
Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
Wei C-Y, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Preprint, submitted February 10, https://arxiv.org/abs/2102.05406.Google Scholar
Yadkori YA, Bartlett PL, Kanade V, Seldin Y, Szepesvári C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Proc. 26th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2508–2516.Google Scholar
Yu JY, Mannor S (2009) Online learning in Markov decision processes with arbitrarily changing rewards and transitions. Proc. First Internat. Conf. Game Theory Networks (IEEE, Piscataway, NJ), 314–322.Google Scholar
Yu Y, Chen X, Zhang F (2015) Dynamic capacity management with general upgrading. Oper. Res. 63(6):1372–1389.Link, Google Scholar
Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.Link, Google Scholar
Zhang H, Chao X, Shi C (2019) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.Link, Google Scholar
Zhang Z, Zhou Y, Ji X (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15198–15207.Google Scholar
Zhao P, Zhang L, Jiang Y, Zhou Z-H (2020) A simple approach for non-stationary linear bandits. Proc. Twenty Third Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, New York), 746–755.Google Scholar
Zhou H, Chen J, Varshney LR, Jagmohan A (2020a) Nonstationary reinforcement learning with linear function approximation. Preprint, submitted October 8, https://arxiv.org/abs/2010.04244.Google Scholar
Zhou X, Xiong Y, Chen N, Gao X (2020b) Regime switching bandits. Preprint, submitted January 26, https://arxiv.org/abs/2001.09390.Google Scholar

Volume 71, Issue 2

February 2025

Pages iv-vi, 955

Article Information

Supplemental Material

Metrics

Information

Received:August 18, 2022
Accepted:September 24, 2023
Published Online:May 14, 2024

Cite as

Weichao Mao; , Kaiqing Zhang; , Ruihao Zhu; , David Simchi-Levi, Tamer Başar; (2024) Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control. Management Science 71(2):1564-1580.

https://doi.org/10.1287/mnsc.2022.02533

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control

References

Volume 71, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News