Learning the Minimal Representation of a Continuous State-Space Markov Decision Process from Transition Data

Amine Bennouna
Amine Bennouna
[email protected]
https://orcid.org/0000-0002-9123-8588
Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Dessislava Pachamanova
Dessislava Pachamanova
[email protected]
https://orcid.org/0000-0002-1373-1553
Mathematics, Analytics, Science and Technology Division, Babson College, Wellesley, Massachusetts 02457
Search for more papers by this author
,
Georgia Perakis
Corresponding Author
Georgia Perakis
[email protected]
https://orcid.org/0000-0002-0888-9030
Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Omar Skali Lami
Omar Skali Lami
[email protected]
Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Dessislava Pachamanova

[email protected]

https://orcid.org/0000-0002-1373-1553

Mathematics, Analytics, Science and Technology Division, Babson College, Wellesley, Massachusetts 02457

Search for more papers by this author

Georgia Perakis

Corresponding Author

Georgia Perakis

[email protected]

https://orcid.org/0000-0002-0888-9030

Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Omar Skali Lami

[email protected]

Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:26 Sep 2024https://doi.org/10.1287/mnsc.2022.01652

References

Alagoz O, Maillart LM, Schaefer AJ, Roberts MS (2004) The optimal timing of living-donor liver transplantation. Management Sci. 50(10):1420–1430.Link, Google Scholar
Azizzadenesheli K, Lazaric A, Anandkumar A (2016) Reinforcement learning in rich-observation MDPs using spectral methods. Preprint, submitted November 11, https://arxiv.org/abs/1611.03907v4.Google Scholar
Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. Prieditis A, Russell S, eds. Machine Learn. Proc. (Morgan Kaufmann, San Francisco), 30–37.Google Scholar
Bennouna A, Joseph J, Nze-Ndong D, Perakis G, Singhvi D, Lami OS, Spantidakis Y, Thayaparan L, Tsiourvas A (2023) Covid-19: Prediction, prevalence, and the operations of vaccine allocation. Manufacturing Service Oper. Management 25(3):1013–1032.Link, Google Scholar
Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol. 1 (Athena Scientific, Belmont, MA).Google Scholar
Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4):929–965.Crossref, Google Scholar
Brafman RI, Tennenholtz M (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
Coronato A, Naeem M, De Pietro G, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence Medicine 109:101964.Crossref, Google Scholar
Dann C, Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2018) On oracle-efficient PAC RL with rich observations. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1422–1432.Google Scholar
Du S, Krishnamurthy A, Jiang N, Agarwal A, Dudik M, Langford J (2019) Provably efficient RL with rich observations via latent state decoding. Internat. Conf. Machine Learn. (PMLR, New York), 1665–1674.Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. Accessed September 1, 2024, http://archive.ics.uci.edu/ml.Google Scholar
Eckardt JN, Wendt K, Bornhäuser M, Middeke JM (2021) Reinforcement learning for precision oncology. Cancers (Basel) 13(18):4624.Crossref, Google Scholar
Ernst D, Stan GB, Goncalves J, Wehenkel L (2006) Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. Proc. 45th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 667–672.Google Scholar
Feng F, Wang R, Yin W, Du SS, Yang LF (2020) Provably efficient exploration for reinforcement learning using unsupervised learning. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY).Google Scholar
Givan R, Dean T, Greig M (2003) Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence 147(1–2):163–223.Crossref, Google Scholar
Hanneke S (2016) The optimal sample complexity of PAC learning. J. Machine Learn. Res. 17(1):1319–1333.Google Scholar
Hennessy M, Milner R (1985) Algebraic laws for nondeterminism and concurrency. J. ACM 32(1):137–161.Crossref, Google Scholar
Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.Crossref, Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(4):1563–1600.Google Scholar
Jedra Y, Lee J, Proutiere A, Yun SY (2023) Nearly optimal latent state decoding in block MDPs. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 2805–2904.Google Scholar
Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
Johnson M, Hofmann K, Hutton T, Bignell D (2016) The Malmo platform for artificial intelligence experimentation. IJCAI’16 Proc. 25th Internat. Joint Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 4246–4247.Google Scholar
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2):209–232.Crossref, Google Scholar
Krishnamurthy A, Agarwal A, Langford J (2016) PAC reinforcement learning with rich observations. Preprint, submitted February 8, https://arxiv.org/abs/1602.02722.Google Scholar
Le L, Lin A, Pachamanova D, Perakis G, Skali Lami O (2023) An interpretable robust framework for sepsis treatment with limited resources. MSOM Conf.Google Scholar
Lee I (2023) Is separately modeling subpopulations beneficial for sequential decision-making? Oper. Res., ePub ahead of print May 18, https://doi.org/10.1287/opre.2023.2474.Link, Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
Li L, Chu W, Langford J, Wang X (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 297–306.Google Scholar
Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. Proc. 2014 Internat. Conf. Autonomous Agents Multi-Agent Systems (Paris), 1077–1084.Google Scholar
Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 6961–6971.Google Scholar
Nemati S, Ghassemi MM, Clifford GD (2016) Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. 2016 38th Annual Internat. Conf. IEEE Engrg. Medicine Biol. Soc. (EMBC) (IEEE, Piscataway, NJ), 2978–2981.Google Scholar
Peng X, Ding Y, Wihl D, Gottesman O, Komorowski M, Li-wei HL, Ross A, Faisal A, Doshi-Velez F (2018) Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. AMIA Annual Sympos. Proc. 2018:887–896.Google Scholar
Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. Doshi-Velez F, Fackler J, Kale D, Ranganath R, Wallace B, Wiens J, eds. Proc. 2nd Machine Learn. Healthcare Conf., vol. 68 (PMLR, New York), 147–163.Google Scholar
Riachi E, Mamdani M, Fralick M, Rudzicz F (2021) Challenges for reinforcement learning in healthcare. Preprint, submitted March 9, https://arxiv.org/abs/2103.05612.Google Scholar
Rokach L, Maimon O (2005) Clustering methods. Maimon O, Rokach L, eds. Data Mining and Knowledge Discovery Handbook (Springer, Boston), 321–352.Crossref, Google Scholar
Russo D (2020) Approximation benefits of policy gradient methods with aggregated states. Management Sci. 69(11):6898–6911.Google Scholar
Sinclair SR, Banerjee S, Yu CL (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM Measurement Anal. Comput. Systems (Association for Computing Machinery, New York), 1–44.Google Scholar
Sinclair SR, Banerjee S, Yu CL (2023) Adaptive discretization in online reinforcement learning. Oper. Res. 71(5):1636–1652.Link, Google Scholar
Singal R, Besbes O, Desir A, Goyal V, Iyengar G (2022) Shapley meets uniform: An axiomatic framework for attribution in online advertising. Management Sci. 68(10):7457–7479.Link, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
Vapnik V (1998) Statistical Learning Theory (John Wiley & Sons, New York).Google Scholar
Vapnik VN (2019) Complete statistical theory of learning. Automation Remote Control 80(11):1949–1975.Crossref, Google Scholar
Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY), 3021–3029.Google Scholar
Yang CY, Shiranthika C, Wang CY, Chen KW, Sumathipala S (2022) Reinforcement learning strategies in cancer chemotherapy treatment: A review. Comput. Methods Programs Biomedicine 229:107280.Crossref, Google Scholar
Zhang Y, Steimle L, Denton BT (2017) Robust Markov decision processes for medical treatment decisions. Optimization Online (September 21), https://optimization-online.org/?p=13654.Google Scholar
Zhang A, Sodhani S, Khetarpal K, Pineau J (2021) Learning robust state abstractions for hidden-parameter block MDPs. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar

Volume 71, Issue 6

June 2025

Pages iv-vi, 4533-5418

Article Information

Supplemental Material

Metrics

Information

Received:June 05, 2022
Accepted:July 22, 2024
Published Online:September 26, 2024

Cite as

Amine Bennouna; , Dessislava Pachamanova; , Georgia Perakis, Omar Skali Lami; (2024) Learning the Minimal Representation of a Continuous State-Space Markov Decision Process from Transition Data. Management Science 71(6):5162-5184.

https://doi.org/10.1287/mnsc.2022.01652

Keywords

Acknowledgments

The authors thank Janice Yang, Lowell Hensge, Albert Luo, and William Zhao for their valuable excellent research assistance and help with code development. Their heartfelt gratitude goes to the late Rositsa Milyankova for insightful discussions on diabetes treatment. Last but not least, the authors thank the department editor, the associate editor, and three anonymous referees for their constructive and helpful feedback on previous versions of this manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Learning the Minimal Representation of a Continuous State-Space Markov Decision Process from Transition Data

References

Volume 71, Issue 6

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News