Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control

Published Online:https://doi.org/10.1287/mnsc.2022.02533

References

  • Agarwal A, Henaff M, Kakade S, Sun W (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Preprint, submitted July 16, https://arxiv.org/abs/2007.08459.Google Scholar
  • Agrawal S, Jia R (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. 2019 ACM Conf. Econom. Comput. (Association for Computing Machinery, New York), 743–744.Google Scholar
  • Allesiardo R, Féraud R, Maillard O-A (2017) The non-stationary stochastic multi-armed bandit problem. Internat. J. Data Sci. Anal. 3(4):267–283.CrossrefGoogle Scholar
  • Arora R, Dekel O, Tewari A (2012) Deterministic MDPs with adversarial rewards and bandit feedback. Proc. Twenty-Eighth Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 93–101.Google Scholar
  • Arslan G, Yüksel S (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans. Automat. Control 62(4):1545–1558.CrossrefGoogle Scholar
  • Auer P, Gajane P, Ortner R (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 138–158.Google Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Bai Y, Xie T, Jiang N, Wang Y-X (2019) Provably efficient Q-learning with low switching cost. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8004–8013.Google Scholar
  • Balseiro SR, Gur Y (2019) Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Sci. 65(9):3952–3968.LinkGoogle Scholar
  • Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Proc. 27th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
  • Besbes O, Gur Y, Zeevi A (2019) Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems 9(4):319–337.LinkGoogle Scholar
  • Birge JR, Chen H, Keskin NB, Ward A (2024) To interfere or not to interfere: Information revelation and price-setting incentives in a multiagent learning environment. Oper. Res., ePub ahead of print March 27, https://doi.org/10.1287/opre.2023.0363.LinkGoogle Scholar
  • Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, Guo D (2017) Real-time bidding by reinforcement learning in display advertising. Proc. Tenth ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 661–670.Google Scholar
  • Chawla S, Devanur NR, Karlin AR, Sivan B (2016) Simple pricing schemes for consumers with evolving values. Proc. Twenty-Seventh Annual ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 1476–1490.Google Scholar
  • Chen N, Wang C, Wang L (2020a) Learning and optimization with seasonal patterns. Preprint, submitted May 16, https://arxiv.org/abs/2005.08088.Google Scholar
  • Chen Y, Lee C-W, Luo H, Wei C-Y (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 696–726.Google Scholar
  • Chen C, Wei H, Xu N, Zheng G, Yang M, Xiong Y, Xu K, Li Z (2020b) Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3414–3421.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019a) Hedging the drift: Learning to optimize under non-stationarity. Preprint, submitted March 4, https://arxiv.org/abs/1903.01461.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019b) Learning to optimize under non-stationarity. Proc. Twenty-Second Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1079–1087.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019c) Non-stationary reinforcement learning: The blessing of (more) optimism. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3397818.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Preprint, submitted June 24, https://arxiv.org/abs/2006.14389.Google Scholar
  • Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5527–5540.Google Scholar
  • Dick T, Gyorgy A, Szepesvari C (2014) Online learning in Markov decision processes with changing cost sequences. Proc. 31st Internat. Conf. Machine Learn. (PMLR, New York), 512–520.Google Scholar
  • Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Proc. 40th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 3538–3546.Google Scholar
  • Fei Y, Yang Z, Wang Z, Xie Q (2020) Dynamic regret of policy optimization in non-stationary environments. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 6743–6754.Google Scholar
  • Gajane P, Ortner R, Auer P (2018) A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Preprint, submitted May 25, https://arxiv.org/abs/1805.10066.Google Scholar
  • Gao M, Xie T, Du SS, Yang LF (2021) A provably efficient algorithm for linear Markov decision process with low switching cost. Preprint, submitted January 2, https://arxiv.org/abs/2101.00494.Google Scholar
  • Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Proc. 22nd Internat. Conf. Algorithmic Learn. Theory (Springer-Verlag, Berlin, Heidelberg), 174–188.Google Scholar
  • Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.LinkGoogle Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4868–4878.Google Scholar
  • Jin C, Jin T, Luo H, Sra S, Yu T (2019) Learning adversarial MDPs with bandit feedback and unknown transition. Preprint, submitted December 3, https://arxiv.org/abs/1912.01192.Google Scholar
  • Karnin ZS, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 199–207.Google Scholar
  • Keskin NB, Zeevi A (2017) Chasing demand: Learning and earning in a changing environment. Math. Oper. Res. 42(2):277–307.LinkGoogle Scholar
  • Lee C-W, Luo H, Wei C-Y, Zhang M (2020) Linear last-iterate convergence for matrix games and stochastic games. Preprint, submitted June 16, https://arxiv.org/abs/2006.09517v1.Google Scholar
  • Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. Proc. Eleventh Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco), 157–163.Google Scholar
  • Lu J, Yang C, Gao X, Wang L, Li C, Chen G (2019) Reinforcement learning with sequential information clustering in real-time bidding. Proc. 28th ACM Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 1633–1641.Google Scholar
  • Luo H, Wei C-Y, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. Proc. 31st Conf. Learn. Theory (PMLR, New York), 1739–1776.Google Scholar
  • Lykouris T, Simchowitz M, Slivkins A, Sun W (2019) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted November 20, https://arxiv.org/abs/1911.08689.Google Scholar
  • Ma W (2018) Improvements and generalizations of stochastic knapsack and Markovian bandits approximation algorithms. Math. Oper. Res. 43(3):789–812.LinkGoogle Scholar
  • Mao W, Zhang K, Xie Q, Başar T (2020) POLY-HOOT: Monte-Carlo planning in continuous space MDPs with non-asymptotic analysis. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4549–4559.Google Scholar
  • Mao W, Zhang K, Zhu R, Simchi-Levi D, Basar T (2021) Near-optimal model-free reinforcement learning in non-stationary episodic MDPs. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 7447–7458.Google Scholar
  • Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 6961–6971.Google Scholar
  • Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 1804–1812.Google Scholar
  • Ortner R, Gajane P, Auer P (2020) Variational regret bounds for reinforcement learning. Proc. 35th Uncertainty Artificial Intelligence Conf., vol. 115 (PMLR, New York), 81–90.Google Scholar
  • Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
  • Padakandla S (2020) A survey of reinforcement learning algorithms for dynamically varying environments. Preprint, submitted May 19, https://arxiv.org/abs/2005.10619.Google Scholar
  • Radanovic G, Devidze R, Parkes D, Singla A (2019) Learning to collaborate in Markov decision processes. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 5261–5270.Google Scholar
  • Roughgarden T (2009) Intrinsic robustness of the price of anarchy. Proc. 41st Annual ACM Sympos. Theory Comput. (Association for Computing Machinery, New York), 513–522.Google Scholar
  • Shi C, Chen W, Duenyas I (2016) Nonparametric data-driven algorithms for multiproduct inventory systems with censored demand. Oper. Res. 64(2):362–370.LinkGoogle Scholar
  • Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1–2):109–136.CrossrefGoogle Scholar
  • Syrgkanis V, Agarwal A, Luo H, Schapire RE (2015) Fast convergence of regularized learning in games. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2989–2997.Google Scholar
  • Tekin C, Liu M (2010) Online algorithms for the multi-armed bandit problem with Markovian rewards. Proc. 48th Annual Allerton Conf. Comm. Control Comput. (Allerton) (IEEE, Piscataway, NJ), 1675–1682.Google Scholar
  • Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 10279–10288.Google Scholar
  • Touati A, Vincent P (2020) Efficient learning in non-stationary linear Markov decision processes. Preprint, submitted October 24, https://arxiv.org/abs/2010.12870.Google Scholar
  • Wang J, Liu Y, Li B (2020) Reinforcement learning with perturbed rewards. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34, no. 04 (AAAI Press, Palo Alto, CA), 6202–6209.Google Scholar
  • Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
  • Wei C-Y, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Preprint, submitted February 10, https://arxiv.org/abs/2102.05406.Google Scholar
  • Yadkori YA, Bartlett PL, Kanade V, Seldin Y, Szepesvári C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Proc. 26th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2508–2516.Google Scholar
  • Yu JY, Mannor S (2009) Online learning in Markov decision processes with arbitrarily changing rewards and transitions. Proc. First Internat. Conf. Game Theory Networks (IEEE, Piscataway, NJ), 314–322.Google Scholar
  • Yu Y, Chen X, Zhang F (2015) Dynamic capacity management with general upgrading. Oper. Res. 63(6):1372–1389.LinkGoogle Scholar
  • Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.LinkGoogle Scholar
  • Zhang H, Chao X, Shi C (2019) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.LinkGoogle Scholar
  • Zhang Z, Zhou Y, Ji X (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15198–15207.Google Scholar
  • Zhao P, Zhang L, Jiang Y, Zhou Z-H (2020) A simple approach for non-stationary linear bandits. Proc. Twenty Third Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, New York), 746–755.Google Scholar
  • Zhou H, Chen J, Varshney LR, Jagmohan A (2020a) Nonstationary reinforcement learning with linear function approximation. Preprint, submitted October 8, https://arxiv.org/abs/2010.04244.Google Scholar
  • Zhou X, Xiong Y, Chen N, Gao X (2020b) Regime switching bandits. Preprint, submitted January 26, https://arxiv.org/abs/2001.09390.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.