The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Published Online:https://doi.org/10.1287/opre.2025.2240

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal Markov decision process (MDP). Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we provide a near-optimal characterization of the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or χ2 divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case with respect to the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case with respect to the χ2 divergence, the sample complexity of RMDPs far exceeds the standard MDP counterpart.

Funding: The work of L. Shi and Y. Chi is supported in part by [Grant ONR N00014-19-1-2404], the National Science Foundation [Grant CCF-2106778], [Grant DMS-2134080], and [Grant CNS-2148212]. L. Shi is supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship, Liang Ji-Dian Graduate Fellowship at Carnegie Mellon University, the Resnick Institute and the California Institute of Technology [Computing, Data, and Society Postdoctoral Fellowship]. G. Li is supported in part by the Chinese University of Hong Kong [Direct Grant for Research] and the Hong Kong Research Grants Council ECS 24305724 and GRF 14307525. The work of Y. Wei is supported in part by the National Science Foundation [Grants DMS-2147546/2015447, CAREER award DMS-2143215, and CCF-2106778] and the Google Research Scholar Award. The work of Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009 and CCF-1907661].

Supplemental Material: All supplemental materials, including the code, data, and files required to reproduce the results, are available at https://doi.org/10.1287/opre.2025.2240.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.