Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach

Soroush Saghafian
Soroush Saghafian
[email protected]
https://orcid.org/0000-0002-9781-6561
Harvard Kennedy School, Harvard University, Cambridge, Massachusetts 02138
Search for more papers by this author

Harvard Kennedy School, Harvard University, Cambridge, Massachusetts 02138

Search for more papers by this author

Published Online:4 Oct 2023https://doi.org/10.1287/mnsc.2022.00883

References

ADA (2012) Standards of medical care in diabetes. Diabetes Care 35:S11–S63.Crossref, Google Scholar
Ahn D, Choi S, Gale D, Kariv S (2014) Estimating ambiguity aversion in a portfolio choice experiment. Quant. Econom. 5(2):195–223.Crossref, Google Scholar
Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91:434–471.Google Scholar
Arrow KJ (1951) Alternative approaches to the theory of choice in risk-taking situations. Econometrica 19(4):404–437.Crossref, Google Scholar
Arrow KJ, Hurwicz L (1977) An optimality criterion for decision making under ignorance. Arrow KJ, Hurwicz L, eds. Studies in Resource Allocation Processes (Cambridge University Press, Cambridge, UK), 461–472.Crossref, Google Scholar
Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
Bang H, Robins JM (2021) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–972.Crossref, Google Scholar
Bennett A, Kallus N (2021) Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes. Preprint, submitted October 28, https://doi.org/10.48550/arXiv.2110.15332.Google Scholar
Bennett A, Kallus N, Li L, Mousavi A (2021) Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. Proc. 24th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1999–2007.Google Scholar
Bhidé AV (2000) The Origin and Evolution of New Business (Oxford University Press, Oxford, UK).Crossref, Google Scholar
Boloori A, Saghafian S, Chakkera HA, Cook CB (2015) Characterization of remitting and relapsing hyperglycemia in post-renal-transplant recipients. PLoS One 10(11):1–16.Crossref, Google Scholar
Boloori A, Saghafian S, Chakkera HA, Cook CB (2020) Data-driven management of post-transplant medications: An ambiguous partially observable Markov decision process approach. Manufacturing Service Oper. Management 22(5):1066–1087.Link, Google Scholar
Box G (1979) Robustness in the strategy of scientific model building. Launer R, Wilkinson G, eds. Robustness in Statistics (Academic Press, New York), 201–236.Crossref, Google Scholar
Bren A, Saghafian S (2019) Data-driven percentile optimization for Multi-Class queueing systems with model ambiguity: Theory and application. INFORMS J. Optim. 1(4):267–287.Link, Google Scholar
Butler EL, Laber EB, Davis SM, Kosorok MR (2018) Incorporating patient preferences into estimation of optimal individualized treatment rules. Biometrics 74(1):18–26.Crossref, Google Scholar
Chakkera HA, Weil EJ, Castro J, Heilman RL, Reddy KS, Mazur MJ, Hamawi K, et al. (2009) Hyperglycemia during the immediate period after kidney transplantation. Clinical J. Amer. Soc. Nephrology 4:853–859.Crossref, Google Scholar
Chakraborty B, Murphy SA (2014) Dynamic treatment regimes. Annual Rev. Statist. Appl. 1(1):447–464.Crossref, Google Scholar
Dedecker J, Louhichi S (2002) Maximal inequalities and empirical central limit theorems. Mikosch T, Sørensen M, eds. Empirical Process Techniques for Dependent Data (Birkhäuser, Boston), 137–159.Crossref, Google Scholar
Frank RG, Zeckhauser RJ (2007) Custom-made vs. ready-to-wear treatments: Behavioral propensities in physicians’ choices. J. Health Econom. 26(6):1101–1127.Crossref, Google Scholar
Ghiradato P, Maccheroni F, Marinacci M (2004) Differentiating ambiguity and ambiguity attitude. J. Econom. Theory 118:133–173.Crossref, Google Scholar
Ghisdal L, Van Laecke S, Abramowicz MJ, Vanholder R, Abramowicz D (2012) New-onset diabetes after renal transplantation risk assessment and management. Diabetes Care 35(1):181–188.Crossref, Google Scholar
Hansen LP (1982) Large sample properties of generalized method of moments estimators. Econometrica 50(4):1029–1054.Crossref, Google Scholar
Heath C, Tversky A (1991) Preference and belief: Ambiguity and competence in choice under uncertainty. J. Risk Uncertainty 4(1):5–28.Crossref, Google Scholar
Hu Y, Wager S (2021) Off-policy evaluation in partially observed Markov decision processes. Preprint, submitted October 24, https://arxiv.org/abs/2110.12343.Google Scholar
Hurwicz L (1951a) Optimality criteria for decision making under ignorance. Cowles Commission Discussion Paper: Statistics No. 370, Cowles Commission.Google Scholar
Hurwicz L (1951b) Some specification problems and applications to econometric models. Econometrica 19:343–344.Google Scholar
Jiang N, Li L (2016) Doubly robust off-policy value evaluation for reinforcement learning. Proc. 33rd Internat. Conf. Machine Learn. (JMLR: W&CP), 652–661.Google Scholar
Kallus N, Uehara M (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21:1–63.Google Scholar
Kallus N, Zhou A (2020) Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Preprint, submitted February 11, https://arxiv.org/abs/2002.04518.Google Scholar
Kallus N, Zhou A (2021) Minimax-optimal policy learning under unobserved confounding. Management Sci. 67(5):2870–2890.Link, Google Scholar
Kosorok MR (2008) Introduction to Empirical Processes and Semiparametric Inference (Springer, New York).Crossref, Google Scholar
Kosorok MR, Laber EB (2019) Precision medicine. Annual Rev. Statist. Appl. 6(263–286):1243–1254.Google Scholar
Laber EB, Lizotte DJ, Ferguson B (2014) Set-valued dynamic treatment regimes for competing outcomes. Biometrics 70(1):53–61.Crossref, Google Scholar
Leqi L, Kennedy EH (2021) Median optimal treatment regimes. Preprint, submitted March 2, https://arxiv.org/abs/2103.01802.Google Scholar
Linn KA, Laber EB, Stefanski LA (2015) Estimation of dynamic treatment regimes for complex outcomes: Balancing benefits and risks. Kosorok MR, Moodie EEM, eds. Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine (SIAM, Philadelphia), 249–262.Crossref, Google Scholar
Linn KA, Laber EB, Stefanski LA (2017) Interactive Q-learning for quantiles. J. Amer. Statist. Assoc. 112(518):638–649.Crossref, Google Scholar
Lizotte DJ, Laber EB (2016) Multi-objective Markov decision processes for data-driven decision support. J. Machine Learn. Res. 17(1):7378–7405.Google Scholar
Lizotte DJ, Bowling M, Murphy SA (2012) Linear fitted-q iteration with multiple reward functions. J. Machine Learn. Res. 13(1):3253–3295.Google Scholar
Luckett DJ, Laber EB, Kahkoska AR, Maahs DM, Mayer-Davis E, Kosorok MR (2020) Estimating dynamic treatment regimes in mobile health using V-learning. J. Amer. Statist. Assoc. 115(530):692–706.Crossref, Google Scholar
Magnani A, Boyd SP (2009) Convex piecewise-linear fitting. Optim. Engrg. 10:1–17.Crossref, Google Scholar
Manski CF (2007) Identification for Prediction and Decision (Harvard University Press, Cambridge, MA).Google Scholar
Manski CF (2021) Econometrics for decision making: Building foundations sketched by Haavelmo and Wald. Econometrica 89(6):2827–2853.Crossref, Google Scholar
Marinacci M (2002) Probabilistic sophistication and multiple priors. Econometrica 70(2):755–764.Crossref, Google Scholar
Munshi VN, Saghafian S, Cook CB, Aradhyula S, Chakkera HA (2021) Use of imputation and decision modeling to improve diagnosis and management of patients at risk for newonset diabetes after transplantation. Ann. Transplantation 26:1–9.Crossref, Google Scholar
Munshi VN, Saghafian S, Cook CB, Werner KT, Chakkera HA (2020a) Comparison of post-transplantation diabetes mellitus incidence and risk factors between kidney and liver transplantation patients. PLoS One 15(1):1–12.Crossref, Google Scholar
Munshi VN, Saghafian S, Cook CB, Steidley D, Hardaway B, Chakkera HA (2020b) Incidence, risk factors, and trends for post-heart transplantation diabetes mellitus. Amer. J. Cardiology 125(3):436–440.Crossref, Google Scholar
Murphy SA (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B Statist. Methodology 65(2):331–355.Crossref, Google Scholar
Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
Murphy SA, van der Laan MJ, Robins JM, CPPRG (2001) Marginal mean models for dynamic regimes. J. Amer. Statist. Assoc. 96(456):1410–1423.Crossref, Google Scholar
Murphy SA, Deng Y, Laber EB, Maei HR, Sutton RS, Witkiewitz K (2016) A batch, off-policy, actor-critic algorithm for optimizing the average reward. Preprint, submitted July 18, https://arxiv.org/abs/1607.05047.Google Scholar
Namkoong H, Keramati R, Yadlowsky S, Brunskill E (2020) Off-policy policy evaluation for sequential decisions under unobserved confounding. Preprint, submitted March 12, https://arxiv.org/abs/2003.05623.Google Scholar
Nie X, Brunskill E, Wager S (2021) Learning when-to-treat policies. J. Amer. Statist. Assoc. 116(533):392–409.Crossref, Google Scholar
Pearl J (2009) Causality: Models, Reasoning, and Inference (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Pearl J, Robins J (1995) Probabilistic evaluation of sequential plans from causal models with hidden variables. Besnard P, Hanks S, eds. Uncertainty in Artificial Intelligence 11 (Morgan Kaufmann, San Francisco), 444–453.Google Scholar
Precup D, Sutton RS, Singh S (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. Machine Learn., 759–766.Google Scholar
Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Model. 7(9–12):1393–1512.Crossref, Google Scholar
Robins J (1997) Causal inference from complex longitudinal data. Berkane M, ed. Latent Variable Modeling and Applications to Causality (Springer, New York), 69–117.Crossref, Google Scholar
Robins J (2004) Optimal structural nested models for optimal sequential decisions. Proc. Second Seattle Sympos. Biostatistics (Springer, New York), 189–326.Google Scholar
Robins J, Hernán MA, Brumback B (2000) Marginal structural models and causal inference in epidemiology. Epidemiology 11(5):550–560.Crossref, Google Scholar
Rosenbaum PR (2002) Observational Studies (Springer, New York).Crossref, Google Scholar
Rosenbaum PR (2010) Design of Observational Studies (Springer, New York).Crossref, Google Scholar
Rubin DB (1986) Comment: Which ifs have causal answers. J. Amer. Statist. Assoc. 81:961–962.Google Scholar
Saghafian S (2018) Ambiguous partially observable Markov decision processes: Structural results and applications. J. Econom. Theory 178:1–35.Crossref, Google Scholar
Saghafian S, Murphy SA (2021) Innovative healthcare delivery: The scientific and regulatory challenges in designing mHealth interventions. NAM Perspectives. Commentary. Report, National Academy of Medicine, Washington, DC.Google Scholar
Saghafian S, Rasouli M (2019) Robust partially observable Markov decision processes. Working paper, Harvard University, Cambridge, MA.Google Scholar
Saghafian S, Tomlin BT (2016) The newsvendor under demand ambiguity: Combining data with moment and tail information. Oper. Res. 64(1):167–185.Link, Google Scholar
Saghafian S, Tomlin B, Biller S (2022) The Internet of things and information fusion: Who talks to who? Manufacturing Service Oper. Management 24(1):333–351.Link, Google Scholar
Savage L (1951) The theory of statistical decision. J. Amer. Statist. Assoc. 46:55–67.Crossref, Google Scholar
Smallwood R, Sondik EJ (1973) The optimal control of partially observable Markov processes over a finite horizon. Oper. Res. 21(5):1071–1088.Link, Google Scholar
Stoy J (2011) Statistical decisions under ambiguity. Theory Decision 70(2):129–148.Crossref, Google Scholar
Tennenholtz G, Shalit U, Mannor Sh (2020) Off-policy evaluation in partially observable environments. Proc. Conf. AAAI Artificial Intelligence 34:10276–10283.Crossref, Google Scholar
Thomas PS, Brunskill E (2016) Data-efficient off-policy policy evaluation for reinforcement learning. Proc. 33rd Internat. Conf. Machine Learn., 2139–2148.Google Scholar
Tsiatis AA, Davidian M, Holloway ST, Laber EB, Kosorok MR (2019) Dynamic Treatment Regimes: Statistical Methods for Precision Medicine (Chapman and Hall/CRC, Boca Raton, FL).Crossref, Google Scholar
Wald A (1939) Contribution to the theory of statistical estimation and testing hypotheses. Ann. Math. Statist. 10:299–326.Crossref, Google Scholar
Wald A (1945) Statistical decision functions which minimize the maximum risk. Ann. Math. 46:265–280.Crossref, Google Scholar
Wald A (1950) Statistical Decision Functions (Wiley, New York).Google Scholar
Wang L, Zhou Y, Song R, Sherwood B (2018) Quantile-optimal treatment regimes. J. Amer. Statist. Assoc. 113(523):1243–1254.Crossref, Google Scholar
Watson J, Holmes C (2016) Approximate models and robust decisions. Statist. Sci. 31:465–489.Crossref, Google Scholar
Whelton PK, Carey RM, Aronow WS, Casey DE Jr, Collins KJ, Himmelfarb CD, DePalma SM, et al. (2017) 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: A report of the American College of Cardiology/American Heart Association Task Force on clinical practice guidelines. J. Amer. College Cardiology 71(19):e127–e248.Crossref, Google Scholar
Xu Z, Laber E, Staicu AM, Severus E (2020) Latent-state models for precision medicine. Preprint, submitted May 26, https://arxiv.org/abs/2005.13001.Google Scholar
Zhang J, Bareinboim E (2019) Near-optimal reinforcement learning in dynamic treatment regimes. Adv. Neural Inform. Processing Systems, vol. 32 (NeurIPS).Google Scholar
Zhang Y, Laber EB, Davidian M, Tsiatis AA (2018) Interpretable dynamic treatment regimes. J. Amer. Statist. Assoc. 113(524):1541–1549.Crossref, Google Scholar
Zhao YQ, Zeng D, Laber EB, Kosorok MR (2015) New statistical learning methods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc. 110(510):583–598.Crossref, Google Scholar

Volume 70, Issue 9

September 2024

Pages 5627-6482, iii-v

Article Information

Supplemental Material

Metrics

Information

Received:March 21, 2022
Accepted:May 26, 2023
Published Online:October 04, 2023

Cite as

Soroush Saghafian (2023) Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach. Management Science 70(9):5667-5690.

https://doi.org/10.1287/mnsc.2022.00883

Keywords

Acknowledgments

The author is grateful to Susan Murphy (Harvard), Richard Zeckhauser (Harvard), and Guido Imbens (Stanford) for their valuable suggestions and comments.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach

References

Volume 70, Issue 9

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News