Nonstationary Reinforcement Learning: The Blessing of (More) Optimism

Published Online:https://doi.org/10.1287/mnsc.2023.4704

References

  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 24: 25th Annual Conf. Neural Inform. Processing Systems (NIPS), 2312–2320.Google Scholar
  • Abbasi-Yadkori Y, Bartlett PL, Kanade V, Seldin Y, Szepesvári C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJC, Bottou L, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 26: 27th Annual Conf. Neural Inform. Processing Systems (NIPS), 2508–2516.Google Scholar
  • Abernethy J, Amin K, Zhu R (2016) Threshold bandits, with and without censored feedback. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems 29: Annual Conf. Neural Inform. Processing Systems (NIPS), 4889–4897.Google Scholar
  • Agrawal S, Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 26: 27th Annual Conf. Neural Inform. Processing Systems (NIPS), 1184–1194.Google Scholar
  • Agrawal S, Jia R (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
  • Arora R, Dekel O, Tewari A (2012) Deterministic MDPs with adversarial rewards and bandit feedback. de Freitas N, Murphy KP, eds. Proc. 28th Conf. Uncertainty Artificial Intelligence, Catalina Island, CA (AUAI Press), 93–101.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire R (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Balseiro S, Gur Y (2019) Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Sci. 65(9):3952–3968.LinkGoogle Scholar
  • Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Bilmes JA, Ng AY, eds. UAI 2009 Proc. Twenty-Fifth Conf. Uncertainty Artificial Intelligence (AUAI Press), 35–42.Google Scholar
  • Bertsekas D (2017) Dynamic Programming and Optimal Control (Athena Scientific, Nashua, NH).Google Scholar
  • Besbes O, Gur Y, Zeevi A (2019) Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Stochastic Systems 9(4):319–337.LinkGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.CrossrefGoogle Scholar
  • Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22:222–255.LinkGoogle Scholar
  • Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, Guo D (2017) Real-time bidding by reinforcement learning in display advertising. de Rijke M, Shokouhi M, Tomkins A, Zhang M, eds. Proc. 10th ACM Internat. Conf. Web Search Data Mining (WSDM) (ACM, New York), 661–670.Google Scholar
  • Cardoso AR, Wang H, Xu H (2019) Large scale Markov decision processes with changing rewards. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2337–2347.Google Scholar
  • Chen W, Shi C, Duenyas I (2020) Optimal learning algorithms for stochastic inventory systems with random capacities. Production Oper. Management 29(7):1624–1649.CrossrefGoogle Scholar
  • Chen Y, Lee C-W, Luo H, Wei C-Y (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Beygelzimer A, Hsu D, eds. Conf. Learn. Theory, COLT (PMLR), 696–726.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019a) Hedging the drift: Learning to optimize under non-stationarity. Preprint, submitted March 4, https://arxiv.org/abs/1903.01461v1.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019b) Learning to optimize under non-stationarity. Chaudhuri K, Sugiyama M, eds. 22nd Internat. Conf. Artificial Intelligence Statist., AISTATS (PMLR), 1079–1087.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR), 1843–1854.Google Scholar
  • Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms (MIT Press, Cambridge, MA).Google Scholar
  • den Boer A, Keskin NB (2020) Dynamic pricing with demand learning and reference effects. Management Sci. 68(10):7112–7130.Google Scholar
  • Dick T, György A, Szepesvári C (2014) Online learning in Markov decision processes with changing cost sequences. Proc. 31st Internat. Conf. Machine Learn. (ICML) (JMLR), 512–520.Google Scholar
  • Even-Dar E, Kakade SM, Mansour Y (2005) Experts in a Markov decision process. Adv. Neural Inform. Processing Systems 18 (NIPS) (Curran Associates, Inc., Red Hook, NY), 401–408.Google Scholar
  • Flajolet A, Jaillet P (2017) Real-time bidding with side information. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 30: Annual Conf. Neural Inform. Processing Systems (NeurIPS), 5162–5172.Google Scholar
  • Fruit R, Pirotta M, Lazaric A (2018a) Near optimal exploration-exploitation in non-communicating Markov decision processes. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2998–3008.Google Scholar
  • Fruit R, Pirotta M, Lazaric A (2019) Improved analysis of UCRL2 with empirical Bernstein inequality. Preprint, submitted July 10, https://arxiv.org/abs/2007.05456.Google Scholar
  • Fruit R, Pirotta M, Lazaric A, Ortner R (2018b) Efficient bias-span-constrained exploration-exploitation in reinforcement learning. Dy JG, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. Proc. Machine Learn. Res. (ICML), vol. 80 (PMLR), 1573–1581.Google Scholar
  • Gajane P, Ortner R, Auer P (2018) A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Preprint, submitted May 25, https://arxiv.org/abs/1805.10066.Google Scholar
  • Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Algorithmic Learning Theory 22nd Internat. Conf. ALT 2011 (Springer, Berlin), 174–188.Google Scholar
  • Google (2011) The arrival of real-time bidding. Accessed February 11, 2023, https://www.rtbchina.com/wp-content/uploads/2012/03/Google-White-Paper-The-Arrival-of-Real-Time-Bidding-July-2011.pdf.Google Scholar
  • Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 4967–4977.Google Scholar
  • Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27: Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
  • Han Y, Zhou Z, Weissman T (2020) Optimal no-regret learning in repeated first-price auctions. Preprint, submitted May 8, https://arxiv.org/abs/2003.09795v4.Google Scholar
  • Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Google Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11:1563–1600.Google Scholar
  • Jin C, Jin T, Luo H, Sra S, Yu T (2020) Learning adversarial Markov decision processes with bandit feedback and unknown transition. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (ICML) (PMLR), 4860–4869.Google Scholar
  • Karnin Z, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems 29: Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 199–207.Google Scholar
  • Keskin N, Zeevi A (2016) Chasing demand: Learning and earning in a changing environments. Math. Oper. Res. 42(2):277–307.LinkGoogle Scholar
  • Keskin NB, Li M (2022) Selling quality-differentiated products in a Markovian market with unknown transition probabilities. Preprint, submitted November 28, http://dx.doi.org/10.2139/ssrn.3526568.Google Scholar
  • Kim MJ, Lim AEB (2016) Robust multiarmed bandit problems. Management Sci. 62(1):264–285.Google Scholar
  • Lattimore T, Szepesvári C (2018) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
  • Li Y, Zhong A, Qu G, Li N (2019) Online Markov decision processes with time-varying transition probabilities and rewards. ICML Workshop Real-World Sequential Decision Making.Google Scholar
  • Luo H, Wei C, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. Bubeck S, Perchet V, Rigollet P, eds. Proc. Conf. Learn. Theory (COLT) (PMLR), 1739–1776.Google Scholar
  • Ma W (2018) Improvements and generalizations of stochastic knapsack and markovian bandits approximation algorithms. Math. Oper. Res. 43(3):789–812.Google Scholar
  • Neu G, György A, Szepesvári C, Antos A (2010) Online Markov decision processes under Bandit feedback. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems 23: 24th Annual Conf. Neural Inform. Processing Systems (NIPS) (Curran Associates, Inc., Red Hook, NY), 1804–1812.Google Scholar
  • Neu G, Gyorgy A, Szepesvari C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami MA, eds. Proc. 15th Conf. Artificial Intelligence Statist., vol. 22 (PMLR), 805–813.Google Scholar
  • Ortner R, Gajane P, Auer P (2019) Variational regret bounds for reinforcement learning. Globerson A, Silva R, eds. Proc. 35th Conf. Uncertainty Artificial Intelligence, UAI (AUAI Press), 81–90.Google Scholar
  • Qin Z, Zhu H, Ye J (2022) Reinforcement learning for ridesharing: An extended survey. Transportation Res. Part C Emerging Tech. 144:103852.CrossrefGoogle Scholar
  • Rosenberg A, Mansour Y (2019) Online convex optimization in adversarial Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR), 5478–5486.Google Scholar
  • Sidford A, Wang M, Wu X, Ye Y (2018a) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. 29th Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM), 770–787.Google Scholar
  • Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018b) Near-optimal time and sample complexities for solving discounted Markov decision process with a generative model. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 5192–5202.Google Scholar
  • Simchi-Levi D, Chen X, Bramel J (2014) The Logic of Logistics: Theory, Algorithms, and Applications for Logistics and Supply Chain Management, 3rd ed. (Springer, New York).CrossrefGoogle Scholar
  • Treharne JT, Sox CR (2002) Adaptive inventory control for nonstationary demand and partial information. Management Sci. 48(5):607–624.Google Scholar
  • Wang M (2019) Randomized linear programming solves the Markov decision problem in nearly-linear (sometimes sublinear) running time. Math. Oper. Res. 45(2):517–546.Google Scholar
  • Wei C-Y, Jafarnia-Jahromi M, Luo H, Sharma H, Jain R (2020) Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. Proc. 37th Internat. Conf. Machine Learn. (ICML), vol. 119, Virtual Event (PMLR), 10170–10180.Google Scholar
  • Yu JY, Mannor S (2009) Online learning in Markov decision processes with arbitrarily changing rewards and transitions. Basar T, Özbay H, eds. 1st Internat. Conf. Game Theory Networks, GAMENETS (IEEE, Piscataway, NJ), 314–322.Google Scholar
  • Yu JY, Mannor S, Shimkin N (2009) Markov decision processes with arbitrary reward processes. Math. Oper. Res. 34:737–757.LinkGoogle Scholar
  • Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.LinkGoogle Scholar
  • Zhang H, Chao X, Shi C (2018) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.Google Scholar
  • Zhang Z, Ji X (2019) Regret minimization for reinforcement learning by evaluating the optimal bias function. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems 32: Annual Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates, Inc., Red Hook, NY), 2823–2832.Google Scholar
  • Zhou X, Chen N, Gao X, Xiong Y (2020) Regime switching bandits. Preprint, submitted January 26, https://arxiv.org/abs/2001.09390v1.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.