Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

Published Online:https://doi.org/10.1287/ijoc.2025.1183

References

  • Almirall D, Chronis-Tuscano A (2016) Adaptive interventions in child and adolescent mental health. J. Clinical Child Adolescent Psych. 45(4):383–395.CrossrefGoogle Scholar
  • Caponnetto A, De Vito E (2007) Optimal rates for the regularized least-squares algorithm. Foundations Comput. Math. 7(3):331–368.CrossrefGoogle Scholar
  • Chakraborty B, Moodie EEM (2013) Statistical Reinforcement Learning (Springer, Berlin).Google Scholar
  • Duan Y, Jin C, Li Z (2021) Risk bounds and Rademacher complexity in batch reinforcement learning. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 2892–2902.Google Scholar
  • Even-Dar E, Mansour Y (2003) Learning rates for q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
  • Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep q-learning. Preprint, submitted January 1, https://arxiv.org/pdf/1901.00137.Google Scholar
  • Figueiredo Prudencio R, Maximo MROA, Colombini EL (2024) A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Trans. Neural Networks Learn. Systems 35(8):10237–10257.CrossrefGoogle Scholar
  • François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J (2018) An introduction to deep reinforcement learning. Foundations Trends Machine Learn. 11(3–4):219–354.CrossrefGoogle Scholar
  • Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 80 (PMLR, New York), 1587–1596.Google Scholar
  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Teh YW, Titterington M, eds. Proc. 13th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 9 (PLMR, New York), 249–356.Google Scholar
  • Goldberg Y, Kosorok MR (2012) Q-learning with censored data. Ann. Statist. 40(1):529–560.CrossrefGoogle Scholar
  • Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
  • Gosavi A (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS J. Comput. 21(2):178–192.LinkGoogle Scholar
  • Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, Boston).Google Scholar
  • Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, et al. (2018) Soft actor-critic algorithms and applications. Preprint, submitted December 13, https://arxiv.org/abs/1812.05905.Google Scholar
  • He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proc. IEEE Internat. Conf. Comput. Vision (IEEE Computer Society, Washington, DC), 1026–1034.Google Scholar
  • Humphrey K (2017) Using reinforcement learning to personalize dosing strategies in a simulated cancer trial with high dimensional data. MS thesis, University of Arizona, Tucson.Google Scholar
  • Ishigooka J, Murasaki M, Miura S, Group T (2000) Olanzapine optimal dose: Results of an open-label multicenter study in schizophrenic patients. Psychiatry Clin. Neurosci. 54(4):467–478.CrossrefGoogle Scholar
  • Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: Model-based policy optimization. Wallch HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 12519–12530.Google Scholar
  • Kearns MJ, Singh SP (1998) Finite-sample convergence rates for q-learning and indirect algorithms. Kearns MJ, Solla SA, Cohn DA, eds. Proc. 11th Internat. Conf. Neural Inform. Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
  • Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
  • Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, et al. (2016) Continuous control with deep reinforcement learning. Bengio Y, LeCun Y, eds. Proc. 4th Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Lin SB, Zhou DX (2018) Distributed kernel-based gradient descent algorithms. Constructive Approximation 47(2):249–276.CrossrefGoogle Scholar
  • Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. J. Machine Learn. Res. 18(92):1–31.Google Scholar
  • Lin SB, Wang D, Zhou DX (2020) Distributed kernel ridge regression with communications. J. Machine Learn. Res. 21(93):1–38.Google Scholar
  • Liu S, Su H (2022) Provably efficient kernelized q-learning. Preprint, submitted April 21, https://arxiv.org/abs/2204.10349.Google Scholar
  • Meister M, Steinwart I (2016) Optimal learning rates for localized SVMs. J. Machine Learn. Res. 17(194):1–44.Google Scholar
  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 1928–1937.Google Scholar
  • Murphy SA (2005a) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.CrossrefGoogle Scholar
  • Murphy SA (2005b) A generalization error for q-learning. J. Machine Learn. Res. 6(37):1073–1097.Google Scholar
  • Oh EJ, Qian M, Cheung YK (2022) Generalization error bounds of dynamic treatment regimes in penalized regression-based learning. Ann. Statist. 50(4):2047–2071.CrossrefGoogle Scholar
  • Ong HY, Chavez K, Hong A (2015) Distributed deep q-learning. Preprint, submitted August 18, https://arxiv.org/abs/1508.04186.Google Scholar
  • Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.LinkGoogle Scholar
  • Padmanabhan R, Meskin N, Haddad WM (2017) Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Math. Biosci. 293(3):11–20.CrossrefGoogle Scholar
  • Pinelis I (1994) Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probability 22(4):1679–1706.CrossrefGoogle Scholar
  • Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models for optimal sepsis treatment: A deep reinforcement learning approach. Doshi-Velez F, Fackler J, Kale D, Ranganath R, Wallace B, Wiens J, eds. Proc. Machine Learn. for Healthcare Conf., Proceedings of Machine Learning Research, vol. 68 (PMLR, New York), 147–163.Google Scholar
  • Rudi A, Camoriano R, Rosasco L (2015) Less is more: Nyström computational regularization. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Proc. 28th Internat. Conf. Neural Inform. Processing Systems, vol. 28 (Curran Associates, Inc., Red Hook, NY), 1657–1665.Google Scholar
  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
  • Socinski MA, Stinchcombe TE (2007) Duration of first-line chemotherapy in advanced non small-cell lung cancer: Less is more in the era of effective subsequent therapies. J. Clinical Oncology 25(33):5155–5157.CrossrefGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla SA, Leen TK, Müller KB, eds. Proc. 12th Internat. Conf. Neural Inform. Processing Systems, vol 12 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
  • Tseng HH, Luo Y, Cui S, Chien JT, Ten Haken RK, Naqa IE (2017) Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical Phys. 44(12):6690–6705.CrossrefGoogle Scholar
  • Tsiatis AA, Davidian M, Holloway ST, Laber EB (2019) Dynamic Treatment Regimes: Statistical Methods for Precision Medicine (Chapman and Hall/CRC, Boca Raton, FL).CrossrefGoogle Scholar
  • Wainwright MJ (2019) Variance-reduced q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
  • Wang R, Foster DP, Kakade SM (2020) What are the statistical limits of offline RL with linear function approximation? Preprint, submitted October 22, https://arxiv.org/abs/2010.11895.Google Scholar
  • Watkins CJ, Dayan P (1992) Q-learning. Mach. Learn. 8(3–4):279–292.Google Scholar
  • Xu X, Hu D, Lu X (2007) Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Networks 18(4):973–992.CrossrefGoogle Scholar
  • Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Survey 55(1):1–36.CrossrefGoogle Scholar
  • Zhang Y, Duchi J, Wainwright M (2015) Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Machine Learn. Res. 16(102):3299–3340.Google Scholar
  • Zhao Y, Kosorok MR, Zeng D (2009) Reinforcement learning design for cancer clinical trials. Statist. Medicine 28(26):3294–3315.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.