Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes
References
- [1] (2021) Reinforcement learning: Theory and algorithms. Accessed June 2022, https://rltheorybook.github.io/rltheorybook_AJKS.pdf.Google Scholar
- [2] (2017) Minimax regret bounds for reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 263–272.Google Scholar
- [3] (2021) Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. Preprint, submitted May 23, https://dx.doi.org/10.2139/ssrn.3848428.Google Scholar
- [4] (2011) Markov Decision Processes with Applications to Finance (Springer Science & Business Media).Crossref, Google Scholar
- [5] (2013) Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, Oxford, UK).Crossref, Google Scholar
- [6] (1995) Reinforcement learning methods for continuous-time Markov decision. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems, vol. 7 (MIT Press, Cambridge, MA).Google Scholar
- [7] (2017) A short note on Poisson tail bounds. http://www.cs.columbia.edu/∼ccanonne/files/misc/2017-poissonconcentration.pdf.Google Scholar
- [8] (2019) Policy certificates: Towards accountable reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1507–1516.Google Scholar
- [9] (1999) Solving semi-Markov decision problems using average reward reinforcement learning. Management Sci. 45(4):560–574.Link, Google Scholar
- [10] (1967) Contraction mappings in the theory underlying dynamic programming. SIAM Rev. 9(2):165–177.Crossref, Google Scholar
- [11] (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
- [12] (2017) Exploration-exploitation in MDPs with options. Artificial Intelligence Statist. (PMLR, New York), 576–584.Google Scholar
- [13] (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Sci. 40(8):999–1020.Link, Google Scholar
- [14] (2022) Logarithmic regret bounds for continuous-time average-reward Markov decision processes. Preprint, submitted May 23, https://arxiv.org/abs/2205.11168.Google Scholar
- [15] (2019) Explore first, exploit next: The true shape of regret in bandit problems. Math. Oper. Res. 44(2):377–399.Link, Google Scholar
- [16] (2009) Continuous Time Markov Decision Processes: Theory and Applications (Springer-Verlag, Berlin, Heidelberg).Crossref, Google Scholar
- [17] (2021) Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. Preprint, submitted April 19, https://arxiv.org/abs/2104.09311.Google Scholar
- [18] (2015) Finite-horizon optimality for continuous-time Markov decision processes with unbounded transition rates. Adv. Appl. Probab. 47(4):1064–1087.Crossref, Google Scholar
- [19] (2011) Finite horizon semi-Markov decision processes with application to maintenance systems. Eur. J. Oper. Res. 212(1):131–140.Crossref, Google Scholar
- [20] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(51):1563–1600.Google Scholar
- [21] (2022) Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. J. Machine Learn. Res. 23(154):1–55.Google Scholar
- [22] (2022) Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. J. Machine Learn. Res. 23(275):1–50.Google Scholar
- [23] (2023) q-Learning in continuous time. J. Machine Learn. Res. 24(161):1–61.Google Scholar
- [24] (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- [25] (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [26] (1976) Countable-state, continuous-time dynamic programming with structure. Oper. Res. 24(3):477–490.Link, Google Scholar
- [27] (1986) Successive approximations for finite horizon, semi-Markov decision processes with application to asset liquidation. Oper. Res. 34(4):638–644.Link, Google Scholar
- [28] (1968) Finite state continuous time Markov decision processes with a finite planning horizon. SIAM J. Control 6(2):266–280.Crossref, Google Scholar
- [29] (2017) Why is posterior sampling better than optimism for reinforcement learning? Internat. Conf. Machine Learn. (PMLR, New York), 2701–2710.Google Scholar
- [30] (2020) Continuous-time Markov decision processes. Probability Theory and Stochastic Modelling (Springer, Berlin, Heidelberg).Google Scholar
- [31] (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- [32] (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- [33] (2021) Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. Preprint, submitted December 19, https://arxiv.org/abs/2112.10264.Google Scholar
- [34] (2019) Making deep q-learning methods robust to time discretization. Internat. Conf. Machine Learn. (PMLR, New York), 6096–6104.Google Scholar
- [35] (2020) Reinforcement learning in continuous time and space: A stochastic control approach. J. Machine Learn. Res. 21(198):1–34.Google Scholar
- [36] (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar

