Learning the Minimal Representation of a Continuous State-Space Markov Decision Process from Transition Data

Published Online:https://doi.org/10.1287/mnsc.2022.01652

References

  • Alagoz O, Maillart LM, Schaefer AJ, Roberts MS (2004) The optimal timing of living-donor liver transplantation. Management Sci. 50(10):1420–1430.LinkGoogle Scholar
  • Azizzadenesheli K, Lazaric A, Anandkumar A (2016) Reinforcement learning in rich-observation MDPs using spectral methods. Preprint, submitted November 11, https://arxiv.org/abs/1611.03907v4.Google Scholar
  • Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. Prieditis A, Russell S, eds. Machine Learn. Proc. (Morgan Kaufmann, San Francisco), 30–37.Google Scholar
  • Bennouna A, Joseph J, Nze-Ndong D, Perakis G, Singhvi D, Lami OS, Spantidakis Y, Thayaparan L, Tsiourvas A (2023) Covid-19: Prediction, prevalence, and the operations of vaccine allocation. Manufacturing Service Oper. Management 25(3):1013–1032.LinkGoogle Scholar
  • Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol. 1 (Athena Scientific, Belmont, MA).Google Scholar
  • Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4):929–965.CrossrefGoogle Scholar
  • Brafman RI, Tennenholtz M (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
  • Coronato A, Naeem M, De Pietro G, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence Medicine 109:101964.CrossrefGoogle Scholar
  • Dann C, Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2018) On oracle-efficient PAC RL with rich observations. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1422–1432.Google Scholar
  • Du S, Krishnamurthy A, Jiang N, Agarwal A, Dudik M, Langford J (2019) Provably efficient RL with rich observations via latent state decoding. Internat. Conf. Machine Learn. (PMLR, New York), 1665–1674.Google Scholar
  • Dua D, Graff C (2017) UCI machine learning repository. Accessed September 1, 2024, http://archive.ics.uci.edu/ml.Google Scholar
  • Eckardt JN, Wendt K, Bornhäuser M, Middeke JM (2021) Reinforcement learning for precision oncology. Cancers (Basel) 13(18):4624.CrossrefGoogle Scholar
  • Ernst D, Stan GB, Goncalves J, Wehenkel L (2006) Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. Proc. 45th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 667–672.Google Scholar
  • Feng F, Wang R, Yin W, Du SS, Yang LF (2020) Provably efficient exploration for reinforcement learning using unsupervised learning. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Givan R, Dean T, Greig M (2003) Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence 147(1–2):163–223.CrossrefGoogle Scholar
  • Hanneke S (2016) The optimal sample complexity of PAC learning. J. Machine Learn. Res. 17(1):1319–1333.Google Scholar
  • Hennessy M, Milner R (1985) Algebraic laws for nondeterminism and concurrency. J. ACM 32(1):137–161.CrossrefGoogle Scholar
  • Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.CrossrefGoogle Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(4):1563–1600.Google Scholar
  • Jedra Y, Lee J, Proutiere A, Yun SY (2023) Nearly optimal latent state decoding in block MDPs. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 2805–2904.Google Scholar
  • Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
  • Johnson M, Hofmann K, Hutton T, Bignell D (2016) The Malmo platform for artificial intelligence experimentation. IJCAI’16 Proc. 25th Internat. Joint Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 4246–4247.Google Scholar
  • Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2):209–232.CrossrefGoogle Scholar
  • Krishnamurthy A, Agarwal A, Langford J (2016) PAC reinforcement learning with rich observations. Preprint, submitted February 8, https://arxiv.org/abs/1602.02722.Google Scholar
  • Le L, Lin A, Pachamanova D, Perakis G, Skali Lami O (2023) An interpretable robust framework for sepsis treatment with limited resources. MSOM Conf.Google Scholar
  • Lee I (2023) Is separately modeling subpopulations beneficial for sequential decision-making? Oper. Res., ePub ahead of print May 18, https://doi.org/10.1287/opre.2023.2474.LinkGoogle Scholar
  • Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
  • Li L, Chu W, Langford J, Wang X (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 297–306.Google Scholar
  • Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. Proc. 2014 Internat. Conf. Autonomous Agents Multi-Agent Systems (Paris), 1077–1084.Google Scholar
  • Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 6961–6971.Google Scholar
  • Nemati S, Ghassemi MM, Clifford GD (2016) Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. 2016 38th Annual Internat. Conf. IEEE Engrg. Medicine Biol. Soc. (EMBC) (IEEE, Piscataway, NJ), 2978–2981.Google Scholar
  • Peng X, Ding Y, Wihl D, Gottesman O, Komorowski M, Li-wei HL, Ross A, Faisal A, Doshi-Velez F (2018) Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. AMIA Annual Sympos. Proc. 2018:887–896.Google Scholar
  • Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. Doshi-Velez F, Fackler J, Kale D, Ranganath R, Wallace B, Wiens J, eds. Proc. 2nd Machine Learn. Healthcare Conf., vol. 68 (PMLR, New York), 147–163.Google Scholar
  • Riachi E, Mamdani M, Fralick M, Rudzicz F (2021) Challenges for reinforcement learning in healthcare. Preprint, submitted March 9, https://arxiv.org/abs/2103.05612.Google Scholar
  • Rokach L, Maimon O (2005) Clustering methods. Maimon O, Rokach L, eds. Data Mining and Knowledge Discovery Handbook (Springer, Boston), 321–352.CrossrefGoogle Scholar
  • Russo D (2020) Approximation benefits of policy gradient methods with aggregated states. Management Sci. 69(11):6898–6911.Google Scholar
  • Sinclair SR, Banerjee S, Yu CL (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM Measurement Anal. Comput. Systems (Association for Computing Machinery, New York), 1–44.Google Scholar
  • Sinclair SR, Banerjee S, Yu CL (2023) Adaptive discretization in online reinforcement learning. Oper. Res. 71(5):1636–1652.LinkGoogle Scholar
  • Singal R, Besbes O, Desir A, Goyal V, Iyengar G (2022) Shapley meets uniform: An axiomatic framework for attribution in online advertising. Management Sci. 68(10):7457–7479.LinkGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.LinkGoogle Scholar
  • Vapnik V (1998) Statistical Learning Theory (John Wiley & Sons, New York).Google Scholar
  • Vapnik VN (2019) Complete statistical theory of learning. Automation Remote Control 80(11):1949–1975.CrossrefGoogle Scholar
  • Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY), 3021–3029.Google Scholar
  • Yang CY, Shiranthika C, Wang CY, Chen KW, Sumathipala S (2022) Reinforcement learning strategies in cancer chemotherapy treatment: A review. Comput. Methods Programs Biomedicine 229:107280.CrossrefGoogle Scholar
  • Zhang Y, Steimle L, Denton BT (2017) Robust Markov decision processes for medical treatment decisions. Optimization Online (September 21), https://optimization-online.org/?p=13654.Google Scholar
  • Zhang A, Sodhani S, Khetarpal K, Pineau J (2021) Learning robust state abstractions for hidden-parameter block MDPs. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.