Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

Xuefeng Gao
Corresponding Author
Xuefeng Gao
[email protected]
https://orcid.org/0000-0003-2424-8257
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region, China
Search for more papers by this author
,
Xunyu Zhou
Xunyu Zhou
[email protected]
Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, New York 10027
Search for more papers by this author

Corresponding Author

Xuefeng Gao

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region, China

Search for more papers by this author

Xunyu Zhou

[email protected]

Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:12 Feb 2025https://doi.org/10.1287/moor.2022.0283

References

[1] Agarwal A, Jiang N, Kakade SM, Sun W (2021) Reinforcement learning: Theory and algorithms. Accessed June 2022, https://rltheorybook.github.io/rltheorybook_AJKS.pdf.Google Scholar
[2] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 263–272.Google Scholar
[3] Basei M, Guo X, Hu A, Zhang Y (2021) Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3848428.Google Scholar
[4] Bäuerle N, Rieder U (2011) Markov Decision Processes with Applications to Finance (Springer Science & Business Media).Crossref, Google Scholar
[5] Boucheron S, Lugosi G, Massart P (2013) Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, Oxford, UK).Crossref, Google Scholar
[6] Bradtke SJ, Duff MO (1995) Reinforcement learning methods for continuous-time Markov decision. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems, vol. 7 (MIT Press, Cambridge, MA).Google Scholar
[7] Canonne CL (2017) A short note on Poisson tail bounds. http://www.cs.columbia.edu/∼ccanonne/files/misc/2017-poissonconcentration.pdf.Google Scholar
[8] Dann C, Li L, Wei W, Brunskill E (2019) Policy certificates: Towards accountable reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1507–1516.Google Scholar
[9] Das TK, Gosavi A, Mahadevan S, Marchalleck N (1999) Solving semi-Markov decision problems using average reward reinforcement learning. Management Sci. 45(4):560–574.Link, Google Scholar
[10] Denardo EV (1967) Contraction mappings in the theory underlying dynamic programming. SIAM Rev. 9(2):165–177.Crossref, Google Scholar
[11] Domingues OD, Ménard P, Kaufmann E, Valko M (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
[12] Fruit R, Lazaric A (2017) Exploration-exploitation in MDPs with options. Artificial Intelligence Statist. (PMLR, New York), 576–584.Google Scholar
[13] Gallego G, Van Ryzin G (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Sci. 40(8):999–1020.Link, Google Scholar
[14] Gao X, Zhou XY (2022) Logarithmic regret bounds for continuous-time average-reward Markov decision processes. Preprint, submitted May 23, https://arxiv.org/abs/2205.11168.Google Scholar
[15] Garivier A, Ménard P, Stoltz G (2019) Explore first, exploit next: The true shape of regret in bandit problems. Math. Oper. Res. 44(2):377–399.Link, Google Scholar
[16] Guo X, Hernández-Lerma O (2009) Continuous Time Markov Decision Processes: Theory and Applications (Springer-Verlag, Berlin, Heidelberg).Crossref, Google Scholar
[17] Guo X, Hu A, Zhang Y (2021) Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. Preprint, submitted April 19, https://arxiv.org/abs/2104.09311.Google Scholar
[18] Guo X, Huang X, Huang Y (2015) Finite-horizon optimality for continuous-time Markov decision processes with unbounded transition rates. Adv. Appl. Probab. 47(4):1064–1087.Crossref, Google Scholar
[19] Huang Y, Guo X (2011) Finite horizon semi-Markov decision processes with application to maintenance systems. Eur. J. Oper. Res. 212(1):131–140.Crossref, Google Scholar
[20] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
[21] Jia Y, Zhou XY (2022) Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. J. Machine Learn. Res. 23(154):1–55.Google Scholar
[22] Jia Y, Zhou XY (2022) Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. J. Machine Learn. Res. 23(275):1–50.Google Scholar
[23] Jia Y, Zhou XY (2023) q-Learning in continuous time. J. Machine Learn. Res. 24(161):1–61.Google Scholar
[24] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY).Google Scholar
[25] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[26] Lippman SA (1976) Countable-state, continuous-time dynamic programming with structure. Oper. Res. 24(3):477–490.Link, Google Scholar
[27] Mamer JW (1986) Successive approximations for finite horizon, semi-Markov decision processes with application to asset liquidation. Oper. Res. 34(4):638–644.Link, Google Scholar
[28] Miller BL (1968) Finite state continuous time Markov decision processes with a finite planning horizon. SIAM J. Control 6(2):266–280.Crossref, Google Scholar
[29] Osband I, Van Roy B (2017) Why is posterior sampling better than optimism for reinforcement learning? Internat. Conf. Machine Learn. (PMLR, New York), 2701–2710.Google Scholar
[30] Piunovskiy A, Zhang Y (2020) Continuous-time Markov decision processes. Probability Theory and Stochastic Modelling (Springer, Berlin, Heidelberg).Google Scholar
[31] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
[32] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
[33] Szpruch L, Treetanthiploet T, Zhang Y (2021) Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. Preprint, submitted December 19, https://arxiv.org/abs/2112.10264.Google Scholar
[34] Tallec C, Blier L, Ollivier Y (2019) Making deep q-learning methods robust to time discretization. Internat. Conf. Machine Learn. (PMLR, New York), 6096–6104.Google Scholar
[35] Wang H, Zariphopoulou T, Zhou XY (2020) Reinforcement learning in continuous time and space: A stochastic control approach. J. Machine Learn. Res. 21(198):1–34.Google Scholar
[36] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar

cover image Mathematics of Operations Research

Volume 51, Issue 1

February 2026

Pages iv-viii, 1-851

Article Information

Metrics

Information

Received:October 03, 2022
Accepted:November 09, 2024
Published Online:February 12, 2025

Cite as

Xuefeng Gao, Xunyu Zhou (2025) Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes. Mathematics of Operations Research 51(1):333-357.

https://doi.org/10.1287/moor.2022.0283

Keywords

Acknowledgments

The authors thank the area editor, the associate editor, and three anonymous referees for many constructive comments and suggestions, which have led to a substantial improvement of the paper. In particular, one of the referees pointed out a (subtle) gap in the proof of a result in the previous version of the paper, which motivated the authors to devise a substantial new argument to fill the gap. The authors also thank Wenhao Xu for help with the simulation experiment.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

References

Volume 51, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News