Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Andrew Bennett
Andrew Bennett
[email protected]
Cornell Tech, Cornell University, New York, New York 10044
Search for more papers by this author
,
Nathan Kallus
Corresponding Author
Nathan Kallus
[email protected]
https://orcid.org/0000-0003-1672-0507
Cornell Tech, Cornell University, New York, New York 10044
Search for more papers by this author

Andrew Bennett

[email protected]

Cornell Tech, Cornell University, New York, New York 10044

Search for more papers by this author

Nathan Kallus

Corresponding Author

Nathan Kallus

[email protected]

https://orcid.org/0000-0003-1672-0507

Cornell Tech, Cornell University, New York, New York 10044

Search for more papers by this author

Published Online:26 Sep 2023https://doi.org/10.1287/opre.2021.0781

Abstract

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in an unknown POMDP given observations of trajectories with only partial state observations and generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We term the resulting framework proximal reinforcement learning (PRL). We then show how to construct estimators in these settings and prove they are semiparametrically efficient. We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.

Funding: This work was supported by the National Science Foundation [Grant 1846210].

Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2021.0781.

Volume 72, Issue 3

May-June 2024

Pages iii-vi, 871-1316, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:December 14, 2021
Accepted:July 27, 2023
Published Online:September 26, 2023

Cite as

Andrew Bennett, Nathan Kallus (2023) Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes. Operations Research 72(3):1071-1086.

https://doi.org/10.1287/opre.2021.0781

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Abstract

Volume 72, Issue 3

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News