Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

Di Wang
Di Wang
[email protected]
https://orcid.org/0000-0003-0435-0609
Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China
Search for more papers by this author
,
Yao Wang
Yao Wang
[email protected]
https://orcid.org/0000-0003-4207-5273
Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China
Search for more papers by this author
,
Shao-Bo Lin
Corresponding Author
Shao-Bo Lin
[email protected]
https://orcid.org/0000-0001-5122-9153
Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China
Search for more papers by this author

Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China

Search for more papers by this author

Yao Wang

[email protected]

https://orcid.org/0000-0003-4207-5273

Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China

Search for more papers by this author

Shao-Bo Lin

Corresponding Author

Shao-Bo Lin

[email protected]

https://orcid.org/0000-0001-5122-9153

Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, China

Search for more papers by this author

Published Online:2 Jun 2026https://doi.org/10.1287/ijoc.2025.1183

References

Almirall D, Chronis-Tuscano A (2016) Adaptive interventions in child and adolescent mental health. J. Clinical Child Adolescent Psych. 45(4):383–395.Crossref, Google Scholar
Caponnetto A, De Vito E (2007) Optimal rates for the regularized least-squares algorithm. Foundations Comput. Math. 7(3):331–368.Crossref, Google Scholar
Chakraborty B, Moodie EEM (2013) Statistical Reinforcement Learning (Springer, Berlin).Google Scholar
Duan Y, Jin C, Li Z (2021) Risk bounds and Rademacher complexity in batch reinforcement learning. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 2892–2902.Google Scholar
Even-Dar E, Mansour Y (2003) Learning rates for q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep q-learning. Preprint, submitted January 1, https://arxiv.org/pdf/1901.00137.Google Scholar
Figueiredo Prudencio R, Maximo MROA, Colombini EL (2024) A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Trans. Neural Networks Learn. Systems 35(8):10237–10257.Crossref, Google Scholar
François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J (2018) An introduction to deep reinforcement learning. Foundations Trends Machine Learn. 11(3–4):219–354.Crossref, Google Scholar
Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 80 (PMLR, New York), 1587–1596.Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Teh YW, Titterington M, eds. Proc. 13th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 9 (PLMR, New York), 249–356.Google Scholar
Goldberg Y, Kosorok MR (2012) Q-learning with censored data. Ann. Statist. 40(1):529–560.Crossref, Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
Gosavi A (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS J. Comput. 21(2):178–192.Link, Google Scholar
Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, Boston).Google Scholar
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, et al. (2018) Soft actor-critic algorithms and applications. Preprint, submitted December 13, https://arxiv.org/abs/1812.05905.Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proc. IEEE Internat. Conf. Comput. Vision (IEEE Computer Society, Washington, DC), 1026–1034.Google Scholar
Humphrey K (2017) Using reinforcement learning to personalize dosing strategies in a simulated cancer trial with high dimensional data. MS thesis, University of Arizona, Tucson.Google Scholar
Ishigooka J, Murasaki M, Miura S, Group T (2000) Olanzapine optimal dose: Results of an open-label multicenter study in schizophrenic patients. Psychiatry Clin. Neurosci. 54(4):467–478.Crossref, Google Scholar
Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: Model-based policy optimization. Wallch HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 12519–12530.Google Scholar
Kearns MJ, Singh SP (1998) Finite-sample convergence rates for q-learning and indirect algorithms. Kearns MJ, Solla SA, Cohn DA, eds. Proc. 11th Internat. Conf. Neural Inform. Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, et al. (2016) Continuous control with deep reinforcement learning. Bengio Y, LeCun Y, eds. Proc. 4th Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Lin SB, Zhou DX (2018) Distributed kernel-based gradient descent algorithms. Constructive Approximation 47(2):249–276.Crossref, Google Scholar
Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. J. Machine Learn. Res. 18(92):1–31.Google Scholar
Lin SB, Wang D, Zhou DX (2020) Distributed kernel ridge regression with communications. J. Machine Learn. Res. 21(93):1–38.Google Scholar
Liu S, Su H (2022) Provably efficient kernelized q-learning. Preprint, submitted April 21, https://arxiv.org/abs/2204.10349.Google Scholar
Meister M, Steinwart I (2016) Optimal learning rates for localized SVMs. J. Machine Learn. Res. 17(194):1–44.Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 1928–1937.Google Scholar
Murphy SA (2005a) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
Murphy SA (2005b) A generalization error for q-learning. J. Machine Learn. Res. 6(37):1073–1097.Google Scholar
Oh EJ, Qian M, Cheung YK (2022) Generalization error bounds of dynamic treatment regimes in penalized regression-based learning. Ann. Statist. 50(4):2047–2071.Crossref, Google Scholar
Ong HY, Chavez K, Hong A (2015) Distributed deep q-learning. Preprint, submitted August 18, https://arxiv.org/abs/1508.04186.Google Scholar
Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.Link, Google Scholar
Padmanabhan R, Meskin N, Haddad WM (2017) Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Math. Biosci. 293(3):11–20.Crossref, Google Scholar
Pinelis I (1994) Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probability 22(4):1679–1706.Crossref, Google Scholar
Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models for optimal sepsis treatment: A deep reinforcement learning approach. Doshi-Velez F, Fackler J, Kale D, Ranganath R, Wallace B, Wiens J, eds. Proc. Machine Learn. for Healthcare Conf., Proceedings of Machine Learning Research, vol. 68 (PMLR, New York), 147–163.Google Scholar
Rudi A, Camoriano R, Rosasco L (2015) Less is more: Nyström computational regularization. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Proc. 28th Internat. Conf. Neural Inform. Processing Systems, vol. 28 (Curran Associates, Inc., Red Hook, NY), 1657–1665.Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
Socinski MA, Stinchcombe TE (2007) Duration of first-line chemotherapy in advanced non small-cell lung cancer: Less is more in the era of effective subsequent therapies. J. Clinical Oncology 25(33):5155–5157.Crossref, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla SA, Leen TK, Müller KB, eds. Proc. 12th Internat. Conf. Neural Inform. Processing Systems, vol 12 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
Tseng HH, Luo Y, Cui S, Chien JT, Ten Haken RK, Naqa IE (2017) Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical Phys. 44(12):6690–6705.Crossref, Google Scholar
Tsiatis AA, Davidian M, Holloway ST, Laber EB (2019) Dynamic Treatment Regimes: Statistical Methods for Precision Medicine (Chapman and Hall/CRC, Boca Raton, FL).Crossref, Google Scholar
Wainwright MJ (2019) Variance-reduced q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
Wang R, Foster DP, Kakade SM (2020) What are the statistical limits of offline RL with linear function approximation? Preprint, submitted October 22, https://arxiv.org/abs/2010.11895.Google Scholar
Watkins CJ, Dayan P (1992) Q-learning. Mach. Learn. 8(3–4):279–292.Google Scholar
Xu X, Hu D, Lu X (2007) Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Networks 18(4):973–992.Crossref, Google Scholar
Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Survey 55(1):1–36.Crossref, Google Scholar
Zhang Y, Duchi J, Wainwright M (2015) Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Machine Learn. Res. 16(102):3299–3340.Google Scholar
Zhao Y, Kosorok MR, Zeng D (2009) Reinforcement learning design for cancer clinical trials. Statist. Medicine 28(26):3294–3315.Crossref, Google Scholar

cover image INFORMS Journal on Computing

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:February 20, 2025
Accepted:April 02, 2026
Published Online:June 02, 2026

Cite as

Di Wang , Yao Wang , Shao-Bo Lin (2026) Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes. INFORMS Journal on Computing 0(0).

https://doi.org/10.1287/ijoc.2025.1183

Keywords

Acknowledgments

The authors thank the associate editor and two anonymous referees for invaluable comments and suggestions and Dr. Shaojie Tang for insightful suggestions and significant contributions to this work.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News