Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

Xuefeng Gao
Corresponding Author
Xuefeng Gao
[email protected]
https://orcid.org/0000-0003-2424-8257
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region, China
Search for more papers by this author
,
Xunyu Zhou
Xunyu Zhou
[email protected]
Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, New York 10027
Search for more papers by this author

Corresponding Author

Xuefeng Gao

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region, China

Search for more papers by this author

Xunyu Zhou

[email protected]

Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:12 Feb 2025https://doi.org/10.1287/moor.2022.0283

Abstract

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the intertransition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state–action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst case expected regret for the proposed algorithm and establish a worst case lower bound with both bounds of the order of square root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

Funding: X. Gao is supported by the Hong Kong Research Grant Council [Grants 14201421, 14212522, 14200123]. X. Zhou gratefully acknowledges financial support through the Nie Center for Intelligent Asset Management at Columbia.

cover image Mathematics of Operations Research

Volume 51, Issue 1

February 2026

Pages iv-viii, 1-851

Article Information

Metrics

Information

Received:October 03, 2022
Accepted:November 09, 2024
Published Online:February 12, 2025

Cite as

Xuefeng Gao, Xunyu Zhou (2025) Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes. Mathematics of Operations Research 51(1):333-357.

https://doi.org/10.1287/moor.2022.0283

Keywords

Acknowledgments

The authors thank the area editor, the associate editor, and three anonymous referees for many constructive comments and suggestions, which have led to a substantial improvement of the paper. In particular, one of the referees pointed out a (subtle) gap in the proof of a result in the previous version of the paper, which motivated the authors to devise a substantial new argument to fill the gap. The authors also thank Wenhao Xu for help with the simulation experiment.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

Abstract

Volume 51, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News