Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

Published Online:https://doi.org/10.1287/moor.2022.0283

References

  • [1] Agarwal A, Jiang N, Kakade SM, Sun W (2021) Reinforcement learning: Theory and algorithms. Accessed June 2022, https://rltheorybook.github.io/rltheorybook_AJKS.pdf.Google Scholar
  • [2] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 263–272.Google Scholar
  • [3] Basei M, Guo X, Hu A, Zhang Y (2021) Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3848428.Google Scholar
  • [4] Bäuerle N, Rieder U (2011) Markov Decision Processes with Applications to Finance (Springer Science & Business Media).CrossrefGoogle Scholar
  • [5] Boucheron S, Lugosi G, Massart P (2013) Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, Oxford, UK).CrossrefGoogle Scholar
  • [6] Bradtke SJ, Duff MO (1995) Reinforcement learning methods for continuous-time Markov decision. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems, vol. 7 (MIT Press, Cambridge, MA).Google Scholar
  • [7] Canonne CL (2017) A short note on Poisson tail bounds. http://www.cs.columbia.edu/∼ccanonne/files/misc/2017-poissonconcentration.pdf.Google Scholar
  • [8] Dann C, Li L, Wei W, Brunskill E (2019) Policy certificates: Towards accountable reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1507–1516.Google Scholar
  • [9] Das TK, Gosavi A, Mahadevan S, Marchalleck N (1999) Solving semi-Markov decision problems using average reward reinforcement learning. Management Sci. 45(4):560–574.LinkGoogle Scholar
  • [10] Denardo EV (1967) Contraction mappings in the theory underlying dynamic programming. SIAM Rev. 9(2):165–177.CrossrefGoogle Scholar
  • [11] Domingues OD, Ménard P, Kaufmann E, Valko M (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
  • [12] Fruit R, Lazaric A (2017) Exploration-exploitation in MDPs with options. Artificial Intelligence Statist. (PMLR, New York), 576–584.Google Scholar
  • [13] Gallego G, Van Ryzin G (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Sci. 40(8):999–1020.LinkGoogle Scholar
  • [14] Gao X, Zhou XY (2022) Logarithmic regret bounds for continuous-time average-reward Markov decision processes. Preprint, submitted May 23, https://arxiv.org/abs/2205.11168.Google Scholar
  • [15] Garivier A, Ménard P, Stoltz G (2019) Explore first, exploit next: The true shape of regret in bandit problems. Math. Oper. Res. 44(2):377–399.LinkGoogle Scholar
  • [16] Guo X, Hernández-Lerma O (2009) Continuous Time Markov Decision Processes: Theory and Applications (Springer-Verlag, Berlin, Heidelberg).CrossrefGoogle Scholar
  • [17] Guo X, Hu A, Zhang Y (2021) Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. Preprint, submitted April 19, https://arxiv.org/abs/2104.09311.Google Scholar
  • [18] Guo X, Huang X, Huang Y (2015) Finite-horizon optimality for continuous-time Markov decision processes with unbounded transition rates. Adv. Appl. Probab. 47(4):1064–1087.CrossrefGoogle Scholar
  • [19] Huang Y, Guo X (2011) Finite horizon semi-Markov decision processes with application to maintenance systems. Eur. J. Oper. Res. 212(1):131–140.CrossrefGoogle Scholar
  • [20] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
  • [21] Jia Y, Zhou XY (2022) Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. J. Machine Learn. Res. 23(154):1–55.Google Scholar
  • [22] Jia Y, Zhou XY (2022) Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. J. Machine Learn. Res. 23(275):1–50.Google Scholar
  • [23] Jia Y, Zhou XY (2023) q-Learning in continuous time. J. Machine Learn. Res. 24(161):1–61.Google Scholar
  • [24] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • [25] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [26] Lippman SA (1976) Countable-state, continuous-time dynamic programming with structure. Oper. Res. 24(3):477–490.LinkGoogle Scholar
  • [27] Mamer JW (1986) Successive approximations for finite horizon, semi-Markov decision processes with application to asset liquidation. Oper. Res. 34(4):638–644.LinkGoogle Scholar
  • [28] Miller BL (1968) Finite state continuous time Markov decision processes with a finite planning horizon. SIAM J. Control 6(2):266–280.CrossrefGoogle Scholar
  • [29] Osband I, Van Roy B (2017) Why is posterior sampling better than optimism for reinforcement learning? Internat. Conf. Machine Learn. (PMLR, New York), 2701–2710.Google Scholar
  • [30] Piunovskiy A, Zhang Y (2020) Continuous-time Markov decision processes. Probability Theory and Stochastic Modelling (Springer, Berlin, Heidelberg).Google Scholar
  • [31] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • [32] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [33] Szpruch L, Treetanthiploet T, Zhang Y (2021) Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. Preprint, submitted December 19, https://arxiv.org/abs/2112.10264.Google Scholar
  • [34] Tallec C, Blier L, Ollivier Y (2019) Making deep q-learning methods robust to time discretization. Internat. Conf. Machine Learn. (PMLR, New York), 6096–6104.Google Scholar
  • [35] Wang H, Zariphopoulou T, Zhou XY (2020) Reinforcement learning in continuous time and space: A stochastic control approach. J. Machine Learn. Res. 21(198):1–34.Google Scholar
  • [36] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.