Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control
Published Online:14 May 2024https://doi.org/10.1287/mnsc.2022.02533
References
- (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Preprint, submitted July 16, https://arxiv.org/abs/2007.08459.Google Scholar
- (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. 2019 ACM Conf. Econom. Comput. (Association for Computing Machinery, New York), 743–744.Google Scholar
- (2017) The non-stationary stochastic multi-armed bandit problem. Internat. J. Data Sci. Anal. 3(4):267–283.Crossref, Google Scholar
- (2012) Deterministic MDPs with adversarial rewards and bandit feedback. Proc. Twenty-Eighth Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 93–101.Google Scholar
- (2016) Decentralized Q-learning for stochastic teams and games. IEEE Trans. Automat. Control 62(4):1545–1558.Crossref, Google Scholar
- (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 138–158.Google Scholar
- (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
- (2019) Provably efficient Q-learning with low switching cost. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8004–8013.Google Scholar
- (2019) Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Sci. 65(9):3952–3968.Link, Google Scholar
- (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Proc. 27th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
- (2019) Optimal exploration–exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems 9(4):319–337.Link, Google Scholar
- (2024) To interfere or not to interfere: Information revelation and price-setting incentives in a multiagent learning environment. Oper. Res., ePub ahead of print March 27, https://doi.org/10.1287/opre.2023.0363.Link, Google Scholar
- (2017) Real-time bidding by reinforcement learning in display advertising. Proc. Tenth ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 661–670.Google Scholar
- (2016) Simple pricing schemes for consumers with evolving values. Proc. Twenty-Seventh Annual ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 1476–1490.Google Scholar
- (2020a) Learning and optimization with seasonal patterns. Preprint, submitted May 16, https://arxiv.org/abs/2005.08088.Google Scholar
- (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 696–726.Google Scholar
- (2020b) Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3414–3421.Google Scholar
- (2019a) Hedging the drift: Learning to optimize under non-stationarity. Preprint, submitted March 4, https://arxiv.org/abs/1903.01461.Google Scholar
- (2019b) Learning to optimize under non-stationarity. Proc. Twenty-Second Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1079–1087.Google Scholar
- (2019c) Non-stationary reinforcement learning: The blessing of (more) optimism. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3397818.Google Scholar
- (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Preprint, submitted June 24, https://arxiv.org/abs/2006.14389.Google Scholar
- (2020) Independent policy gradient methods for competitive reinforcement learning. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5527–5540.Google Scholar
- (2014) Online learning in Markov decision processes with changing cost sequences. Proc. 31st Internat. Conf. Machine Learn. (PMLR, New York), 512–520.Google Scholar
- (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Proc. 40th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 3538–3546.Google Scholar
- (2020) Dynamic regret of policy optimization in non-stationary environments. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 6743–6754.Google Scholar
- (2018) A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Preprint, submitted May 25, https://arxiv.org/abs/1805.10066.Google Scholar
- (2021) A provably efficient algorithm for linear Markov decision process with low switching cost. Preprint, submitted January 2, https://arxiv.org/abs/2101.00494.Google Scholar
- (2011) On upper-confidence bound policies for switching bandit problems. Proc. 22nd Internat. Conf. Algorithmic Learn. Theory (Springer-Verlag, Berlin, Heidelberg), 174–188.Google Scholar
- (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Link, Google Scholar
- (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
- (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4868–4878.Google Scholar
- (2019) Learning adversarial MDPs with bandit feedback and unknown transition. Preprint, submitted December 3, https://arxiv.org/abs/1912.01192.Google Scholar
- (2016) Multi-armed bandits: Competing with optimal sequences. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 199–207.Google Scholar
- (2017) Chasing demand: Learning and earning in a changing environment. Math. Oper. Res. 42(2):277–307.Link, Google Scholar
- (2020) Linear last-iterate convergence for matrix games and stochastic games. Preprint, submitted June 16, https://arxiv.org/abs/2006.09517v1.Google Scholar
- (1994) Markov games as a framework for multi-agent reinforcement learning. Proc. Eleventh Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco), 157–163.Google Scholar
- (2019) Reinforcement learning with sequential information clustering in real-time bidding. Proc. 28th ACM Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 1633–1641.Google Scholar
- (2018) Efficient contextual bandits in non-stationary worlds. Proc. 31st Conf. Learn. Theory (PMLR, New York), 1739–1776.Google Scholar
- (2019) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted November 20, https://arxiv.org/abs/1911.08689.Google Scholar
- (2018) Improvements and generalizations of stochastic knapsack and Markovian bandits approximation algorithms. Math. Oper. Res. 43(3):789–812.Link, Google Scholar
- (2020) POLY-HOOT: Monte-Carlo planning in continuous space MDPs with non-asymptotic analysis. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4549–4559.Google Scholar
- (2021) Near-optimal model-free reinforcement learning in non-stationary episodic MDPs. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 7447–7458.Google Scholar
- (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 6961–6971.Google Scholar
- (2010) Online Markov decision processes under bandit feedback. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 1804–1812.Google Scholar
- (2020) Variational regret bounds for reinforcement learning. Proc. 35th Uncertainty Artificial Intelligence Conf., vol. 115 (PMLR, New York), 81–90.Google Scholar
- (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
- (2020) A survey of reinforcement learning algorithms for dynamically varying environments. Preprint, submitted May 19, https://arxiv.org/abs/2005.10619.Google Scholar
- (2019) Learning to collaborate in Markov decision processes. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 5261–5270.Google Scholar
- (2009) Intrinsic robustness of the price of anarchy. Proc. 41st Annual ACM Sympos. Theory Comput. (Association for Computing Machinery, New York), 513–522.Google Scholar
- (2016) Nonparametric data-driven algorithms for multiproduct inventory systems with censored demand. Oper. Res. 64(2):362–370.Link, Google Scholar
- (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1–2):109–136.Crossref, Google Scholar
- (2015) Fast convergence of regularized learning in games. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2989–2997.Google Scholar
- (2010) Online algorithms for the multi-armed bandit problem with Markovian rewards. Proc. 48th Annual Allerton Conf. Comm. Control Comput. (Allerton) (IEEE, Piscataway, NJ), 1675–1682.Google Scholar
- (2021) Online learning in unknown Markov games. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 10279–10288.Google Scholar
- (2020) Efficient learning in non-stationary linear Markov decision processes. Preprint, submitted October 24, https://arxiv.org/abs/2010.12870.Google Scholar
- (2020) Reinforcement learning with perturbed rewards. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34, no. 04 (AAAI Press, Palo Alto, CA), 6202–6209.Google Scholar
- (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
- (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Preprint, submitted February 10, https://arxiv.org/abs/2102.05406.Google Scholar
- (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Proc. 26th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2508–2516.Google Scholar
- (2009) Online learning in Markov decision processes with arbitrarily changing rewards and transitions. Proc. First Internat. Conf. Game Theory Networks (IEEE, Piscataway, NJ), 314–322.Google Scholar
- (2015) Dynamic capacity management with general upgrading. Oper. Res. 63(6):1372–1389.Link, Google Scholar
- (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.Link, Google Scholar
- (2019) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.Link, Google Scholar
- (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15198–15207.Google Scholar
- (2020) A simple approach for non-stationary linear bandits. Proc. Twenty Third Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, New York), 746–755.Google Scholar
- (2020a) Nonstationary reinforcement learning with linear function approximation. Preprint, submitted October 8, https://arxiv.org/abs/2010.04244.Google Scholar
- (2020b) Regime switching bandits. Preprint, submitted January 26, https://arxiv.org/abs/2001.09390.Google Scholar

