Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
Abstract
This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than (up to some log factor). Here, is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within . Our sample complexity bound strengthens prior art by a factor of , achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
Funding: Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].
Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0342.
1. Introduction
Multiagent reinforcement learning (MARL), a subfield of reinforcement learning (RL) that involves multiple individuals interacting/competing with each other in a shared environment, has garnered widespread recent interest, partly sparked by its capability of achieving superhuman performance in game playing and autonomous driving (Shalev-Shwartz et al. 2016, Baker et al. 2019, Berner et al. 2019, Brown and Sandholm 2019, Jaderberg et al. 2019, Vinyals et al. 2019). The coexistence of multiple players—whose own welfare might come at the expense of other parties involved—makes MARL inherently more intricate than the single-agent counterpart.
A standard framework to describe the environment and dynamics in competitive MARL is Markov games (MGs), which are generally attributed to Shapley (1953) (originally referred to as stochastic games). Given the conflicting needs of the players, a standard goal in Markov games is to seek some sort of steady-state solutions with the Nash equilibrium (NE) being arguably the most prominent one. Whereas computational intractability has been observed when calculating NEs in general-sum MGs and/or MGs with more than two players (Daskalakis et al. 2009, Daskalakis 2013), an assortment of tractable algorithms have been put forward to solve two-player, zero-sum Markov games. On this front, a large strand of recent works revolves around developing sample- and computation-efficient paradigms (Bai et al. 2020, Xie et al. 2020, Zhang et al. 2020a, Liu et al. 2021, Tian et al. 2021). What is particularly noteworthy here is the recent progress in overcoming the so-called curse of multiple agents (Bai et al. 2020, Jin et al. 2021a, Li et al. 2022b); that is, although the total number of joint actions exhibits exponential scaling in the number of agents, learnability of Nash equilibria becomes plausible even when the sample size scales only linearly with the maximum cardinality of the individual action spaces. See also Jin et al. (2021a), Song et al. (2021), Mao and Başar (2022), and Daskalakis et al. (2023) for similar accomplishments in learning coarse correlated equilibria in multiplayer general-sum MGs.
The aforementioned works permit online data collection either via active exploration of the environment or through sampling access to a simulator. Nevertheless, the fact that real-time data acquisition might be unaffordable or unavailable—for example, it could be time-consuming, costly, and/or unsafe in healthcare and autonomous driving—constitutes a major hurdle for widespread adoption of these online algorithms. This practical consideration inspires a recent flurry of studies collectively referred to as offline RL or batch RL (Kumar et al. 2020, Levine et al. 2020) with the aim of learning based on a historical data set of logged interactions.
1.1. Data Coverage for Offline Markov Games
The feasibility and efficiency of offline RL are largely governed by the coverage of the offline data in hand. On one hand, if the available data set covers all state–action pairs adequately, then there is sufficient information to guarantee learnability; on the other hand, full data coverage imposes an overly stringent requirement that is rarely fulfilled in practice and is oftentimes wasteful in terms of data efficiency. Consequently, a recurring theme in offline RL gravitates around the quest for algorithms that work under minimal data coverage. Encouragingly, the recent advancement on this frontier (e.g., Rashidinejad et al. 2021, Xie et al. 2021) uncovers the sufficiency of single-policy data coverage in single-agent RL; namely, offline RL becomes information-theoretically feasible as soon as the historical data covers the part of the state–action space reachable by a single target policy.
Unfortunately, single-policy coverage is provably insufficient when it comes to Markov games with negative evidence observed in Cui and Du (2022b). Instead, a sort of unilateral coverage—that is, a condition that requires the data to cover not only the target policy pair but also any unilateral deviation from it—seems necessary to ensure learnability of Nash equilibria in two-player, zero-sum MGs. Employing the so-called unilateral concentrability coefficient to quantify such unilateral coverage as well as the degree of distribution shift (which we define shortly in Assumption 1), Cui and Du (2022b) demonstrate how to find ε-Nash solutions in a finite-horizon, two-player, zero-sum MG once the number of sample rollouts exceeds
Here, S is the number of shared states; A and B represent, respectively, the number of actions of the max-player and the min-player; H stands for the horizon length; and the notation denotes the order-wise scaling with all logarithmic dependency hidden.
Despite being an intriguing polynomial sample complexity bound, a shortfall of (1) lies in its unfavorable scaling with AB (i.e., the total number of joint actions), which is substantially larger than the total number of individual actions A + B. Whether it is possible to alleviate this curse of multiple agents in a two-player, zero-sum Markov game—and, if so, how to accomplish it—is the key question to be investigated in the current paper.
1.2. An Overview of Main Results
The objective of this paper is to design a sample-efficient, offline RL algorithm for learning Nash equilibria in two-player, zero-sum Markov games, ideally breaking the curse of multiple agents. Focusing on γ-discounted, infinite-horizon MGs, we propose a model-based paradigm—called the value iteration with lower confidence bounds for zero-sum Markov game (VI-LCB-Game)—that is capable of learning an ε-approximate Nash equilibrium with sample complexity
Finally, when finalizing the current paper, we became aware of an independent study (Cui and Du 2022a; posted to arXiv on June 1, 2022) that also manages to overcome the curse of multiple agents in a two-player, zero-sum Markov game, on which we elaborate toward the end of Section 3.
1.3. Notation
Before proceeding, let us introduce some notation that is used throughout. With slight abuse of notation, we use P to denote a probability transition kernel and the associated probability transition matrix exchangeably. We also use the notation μ exchangeably for a probability distribution and its associated probability vector (and we often do not specify whether μ is a row vector or column vector as long as it is clear from the context). For any two vectors and , we use to denote their Hadamard product, and we also define in an entry-wise fashion. For a finite set , we let represent the probability simplex over the set .
2. Problem Formulation
In this section, we introduce the background of zero-sum Markov games, followed by a description of the offline data set.
2.1. Preliminaries
2.1.1. Zero-Sum, Two-Player Markov Games.
Consider a discounted, infinite-horizon, zero-sum MG (Shapley 1953, Littman 1994) as represented by the tuple . Here, is the shared state space; (respectively, ) is the action space of the max-player (min-player); is the (a priori unknown) probability transition kernel, where denotes the probability of transitioning from state s to state if the max-player executes action a and the min-player chooses action b; is the reward function such that indicates the immediate reward observed by both players in state s when the max-player takes action a and the min-player takes action b; and is the discount factor with commonly referred to as the effective horizon. Throughout this paper, we primarily focus on the scenario in which S, A, B, and could all be large. Additionally, for notational simplicity, we define the vector as for any .
2.1.2. Policy, Value Function, Q-Function, and Occupancy Distribution.
Let and be (possibly random) stationary policies of the max-player and the min-player, respectively. In particular, () specifies the action selection probability of the max-player (min-player) in state s. The value function for a given product policy is defined as
Moreover, we define the discounted occupancy measures associated with an initial state distribution and the product policy as follows:
2.1.3. Nash Equilibrium.
In general, the two players have conflicting goals with the max-player aimed at maximizing the value function and the min-player minimizing the value function. As a result, a standard compromise in Markov games becomes finding an NE. To be precise, a policy pair is said to be a Nash equilibrium if no player can benefit from unilaterally changing the player’s own policy given the opponent’s policy (Nash 1951); that is, for any policies and , one has
As is well-known (Shapley 1953), there exists at least one Nash equilibrium in the discounted, two-player, zero-sum Markov game, and every NE results in the same value function:
In addition, when the max-player’s policy μ is fixed, it is clearly seen that the MG reduces to a single-agent Markov decision process (MDP). In light of this, we define, for each ,
In this paper, our goal can be posed as calculating a policy pair such that
2.2. Offline Data Set (Batch Data Set)
Suppose that we have access to a historical data set containing a batch of N sample transitions , which are generated independently from a distribution and the probability transition kernel P, namely,
The goal is to learn an approximate Nash equilibrium on the basis of this historical data set.
In general, the data distribution might deviate from the one generated by a Nash equilibrium . As a result, whether reliable learning is feasible depends heavily upon the quality of the historical data. To quantify the quality of the data distribution, Cui and Du (2022b) introduce the following unilateral concentrability condition.
(
In words, this quantity employs certain density ratios to measure the distribution mismatch between the target distribution and the data distribution in hand. On the one hand, Assumption 1 is substantially weaker than the type of uniform coverage requirement (which imposes a uniform bound on the density ratio over all simultaneously) as (6) freezes the policy of one side, exhausting over all policies of the other side. On the other hand, Assumption 1 remains more stringent than a single-policy coverage requirement (which only requires the data set to cover the part of the state–action space reachable by a given policy pair ) as (6) requires the data to cover those state–action pairs reachable by any unilateral deviation from the target policy pair . As posited by Cui and Du (2022b), unilateral coverage (i.e., a finite ) is necessary for learning Nash equilibria in Markov games, which stands in sharp contrast to the single-agent case in which single-policy concentrability suffices for finding the optimal policy (Rashidinejad et al. 2021, Xie et al. 2021, Li et al. 2024b).
In this paper, we introduce a modified assumption that might give rise to slightly improved sample complexity bounds.
(
In a nutshell, when or is reasonably large (i.e., larger than ), Assumption 2 no longer requires the data distribution to scale proportionally with or , thus resulting in a (slight) relaxation of Assumption 1. Comparing (7) with (6) immediately reveals that
3. Algorithm and Main Theory
In this section, we propose a pessimistic, model-based, offline algorithm—called VI-LCB-Game—to solve the two-player, zero-sum Markov games. The proposed algorithm is then shown to achieve minimax-optimal sample complexity in finding an approximate Nash equilibrium of the Markov game given offline data.
3.1. Algorithm Design
3.1.1. The Empirical Markov Game.
With the offline data set in hand, we can readily construct an empirical Markov game. To do so, we first compute the sample size
3.1.2. Pessimistic Bellman Operators.
Recall that the classic Bellman operator is defined such that (Shapley 1953, Lagoudakis and Parr 2002), for any ,
Note, however, that we are in need of modified versions of the Bellman operator in order to accommodate the offline setting. In this paper, we introduce the pessimistic Bellman operator () for the max-player (min-player) as follows: for every ,
It is well-known that the classic Bellman operator satisfies the γ-contraction property, which guarantees fast global convergence of classic value iteration. As it turns out, the pessimistic Bellman operators introduced also enjoy the γ-contraction property in the sense that
3.1.3. Pessimistic Value Iteration with Bernstein-Style Penalty.
With the pessimistic Bellman operators in place, we are positioned to present the proposed paradigm. Our algorithm maintains the Q-function iterates , the policy iterates and , and the value function iterates from the max-player’s perspective; at the same time, it also maintains an analogous group of iterates , and , and from the min-player’s perspective. The updates of the two groups of iterates are carried out in a completely decoupled manner except when determining the final output.
In what follows, let us describe the update rules from the max-player’s perspective. For notational simplicity, we write and whenever it is clear from the context. In each round we carry out the following update rules:
Updating Q-function estimates: Run a pessimistic variant of value iteration to yield
(15)The γ-contraction property (14) helps ensure sufficient progress made in each iteration of this update rule.
Updating policy estimates: We then adjust the policies based on the updated Q-function estimates (15). Specifically, for each , we compute the Nash equilibrium of the zero-sum matrix game with payoff matrix . It is worth noting that there is a host of methods for efficiently calculating the NE of a zero-sum matrix game; prominent examples include linear programming and no-regret learning (Raghavan 1994, Freund and Schapire 1999, Rakhlin and Sridharan 2013, Roughgarden 2016).
Policy evaluation: for each , update the value function estimates based on the updated policies as follows:
The updates for , and from the min-player’s perspective are carried out in an analogous and completely independent manner; see Algorithm 1 for details.
3.1.4. Final Output.
By running these update rules for iterations, we arrive at the Q-function estimates
The final policy estimate of the algorithm is then chosen to be
The full algorithm is summarized in Algorithm 1.
(
Initialization: set and for all ; set .
Compute the empirical transition kernel as (8) and the empirical reward function as (9).
For: do
Update
wherefor some sufficiently large constant with defined in (13).For each , compute
where, for any matrix , the function returns a solution to the minimax program .For each , update
Output: the policy pair , where and .
3.2. Theoretical Guarantees
Our main result is to uncover the intriguing sample efficiency of the proposed model-based algorithm. This is formally stated as follows with the proof postponed to Section 6.
Consider any initial state distribution and suppose that Assumption 2 holds. Assume that and consider any and . Then, with probability exceeding , the policy pair returned by Algorithm 1 satisfies
Our result and analysis are inspired by prior works that show that model-based RL achieves, in multiple settings, sample efficiency without the need of variance reduction (Agarwal et al. 2020; Li et al. 2024a, b). The proof of this sample complexity bound entails several key analysis ingredients: (i) a leave-one-out analysis argument that proves effective in decoupling complicated statistical dependency and (ii) a careful self-bounding trick (i.e., upper bounding a certain quantity by a contraction of itself in addition to some other error terms) to derive a sharp control of the target duality gap. See Section 6 for details. Although techniques such as leave-one-out analysis are used in some prior RL literature (Agarwal et al. 2020; Li et al. 2024a, b), as far as we know, our work applies this technique for the first time to multiagent reinforcement learning. It has been observed that extending the algorithmic or analysis ideas in single-agent RL to the multiagent counterpart often leads to suboptimal sample complexity bounds that scale linearly in the total number of joint actions AB (Zhang et al. 2020a, Cui and Du 2022b). In contrast, our analysis framework leads to an optimal sample complexity bound that scales linearly in the total number of individual actions A + B.
A line of recent works focuses on instance-optimality of RL algorithms (Khamaru et al. 2021a, b; Mou et al. 2022). However, it remains challenging to establish instance-dependent bounds for multiagent RL even in two-player, zero-sum Markov games because of the difficulties arising from offline data and multiagent settings. Unlike RL with a generative model (simulator) that can generate independent samples for all state–action pairs, offline RL suffers from substantially more challenges, such as distribution shift and limited data coverage, making it more difficult to derive instance-dependent error bounds. In addition, the prior literature Khamaru et al. (2021a) that establishes instance optimality of variance-reduced Q-learning algorithms for the optimal value estimation problem requires one of the following two conditions: the optimal policy is unique or a meaningful sample complexity bound that depends on an optimality gap can be obtained. However, neither condition has a direct analog in zero-sum Markov games; this is because the Nash equilibrium in a zero-sum Markov game is not unique in general, and there is no well-defined analog of optimality gap for zero-sum Markov games. Detailed discussion on the challenges and difficulty of extending our analysis to develop instance-dependent error bounds can be found in Section 6.5.
The sample complexity needed for Algorithm 1 to compute a policy pair with ε-duality gap is at most
As it turns out, the preceding sample complexity theory for Algorithm 1 matches the minimax lower limit modulo some logarithmic term as asserted by the following theorem. This minimax lower bound—whose proof is postponed to Online Appendix EC.2—is inspired by prior lower bound theory for single-agent MDPs (e.g., Azar et al. 2013, Li et al. 2024b) and might shed light on how to establish lower bounds for other game-theoretic settings.
Consider any , , and , and define the set
Then, there exist some universal constants such that, for any , if the sample size obeys
Here, the infimum is taken over all estimators for the Nash equilibrium based on the batch data set generated according to (5).
The target we are estimating is the NE of a zero-sum MG, which is more challenging than standard statistical estimation problems in the sense that (i) NE is not unique in general and (ii) the error metric is a duality gap. It is challenging to use standard proof frameworks such as Fano’s and Le Cam’s methods to derive a meaningful lower bound for this problem. To overcome this challenge, we construct a family of hard Markov game instances indexed by a binary parameter and then put a prior distribution over this set and compute the posterior probability of failure to differentiate each entry of θ. These steps taken together carefully allow us to compute the desired minimax risk.
As a direct implication of Theorem 2, if the total number of samples in the offline data set obeys
Our theory makes remarkable improvement upon prior art, which can be seen through comparisons with the most relevant prior work (Cui and Du 2022b) (even though the focus therein is finite-horizon, zero-sum MGs). On a high level, Cui and Du (2022b) propose an algorithm that combines pessimistic value iteration with variance reduction (also called reference-advantage decomposition; Zhang et al. 2020b), which provably finds an ε-Nash policy pair using
Perhaps most importantly, our result scales linearly in the total number of individual actions A + B (as opposed to the number of joint actions AB as in Cui and Du 2022b), which manages to alleviate the curse of multiple agents in two-player, zero-sum Markov games.
Our theory accommodates the full ε-range , which is much wider than the range covered by Cui and Du (2022b) (if we view the effective horizon in the infinite-horizon case and the horizon length H in the finite-horizon counterpart as equivalence).
The algorithm design herein is substantially simpler than Cui and Du (2022b): it neither requires sample splitting to decouple statistical dependency, nor relies on reference-advantage decomposition techniques to sharpen the horizon dependency.
When we were finalizing the present manuscript, we became aware of the independent work Cui and Du (2022a) proposing a different offline algorithm—based on incorporation of strategy-wise lower confidence bounds—that improved the prior art as well. When it comes to two-player, zero-sum Markov games with finite horizon and nonstationary transition kernels, Cui and Du (2022a, algorithm 1) provably yields an ε-Nash policy pair using
4. Related Works
4.1. Offline RL and Pessimism Principle
The principle of pessimism in the face of uncertainty, namely, being conservative in value estimation of those state–action pairs that have been under-covered, has been adopted extensively in recent development of offline RL. A highly incomplete list includes Kumar et al. (2020), Kidambi et al. (2020), Yu et al. (2020; 2021a, b), Yin et al. (2021a, c), Rashidinejad et al. (2021), Jin et al. (2021b), Xie et al. (2021), Liu et al. (2020), Zhang et al. (2021c), Chang et al. (2021), Yin and Wang (2021), Uehara and Sun (2021), Munos (2003, 2007), Zanette et al. (2021), Yan et al. (2023), Li et al. (2022a, 2024b), Shi et al. (2022), Cui and Du (2022b), Zhong et al. (2022), Lu et al. (2022), Wang et al. (2022), and Xu and Liang (2022), which unveils the efficacy of the pessimism principle in both model-based and model-free approaches. Among this body of prior works, the ones that are most related to the current paper are Cui and Du (2022a, b), and Zhong et al. (2022), both of which focus on episodic, finite-horizon, zero-sum Markov games with two players. More specifically, Cui and Du (2022b) demonstrate that a unilateral concentrability condition is necessary for learning NEs in offline settings and propose a pessimistic value iteration with reference-advantage decomposition to enable sample efficiency. Zhong et al. (2022) propose a pessimistic minimax value iteration algorithm, which achieves appealing sample complexity in the presence of linear function representation and was recently improved by Xiong et al. (2022). In the concurrent work, Cui and Du (2022a) propose a different pessimistic algorithm that designs lower confidence bounds for policy pairs instead of state–action pairs; for two-player, zero-sum MGs, their algorithm is capable of achieving a sample complexity proportional to A + B. In the single-agent, offline RL setting, Rashidinejad et al. (2021), Yan et al. (2023), and Li et al. (2024b) study offline RL for infinite-horizon MDPs, and Jin et al. (2021b), Xie et al. (2021), Shi et al. (2022), and Li et al. (2024b) look at the finite-horizon episodic counterpart, all of which operate upon some single-policy concentrability assumptions. Among these works, Li et al. (2024b) and Yan et al. (2023) achieved minimax-optimal sample complexity for discounted, infinite-horizon MDPs by means of model-based and model-free algorithms, respectively; similar results have been established for finite-horizon MDPs as well (Xie et al. 2021; Yin et al. 2021b, c; Shi et al. 2022; Li et al. 2024b).
4.2. Multiagent RL and Markov Games
The concept of Markov games—also under the name of stochastic games—dates back to Shapley (1953), which has become a central framework to model competitive multiagent decision making. A large strand of prior works studies how to efficiently solve Markov games when perfect model description is available (Littman 1994, 2001; Hu and Wellman 2003; Hansen et al. 2013; Perolat et al. 2015; Daskalakis et al. 2020, 2023; Cen et al. 2021; Wei et al. 2021; Zhao et al. 2021; Chen et al. 2022; Mao and Başar 2022). Recent years have witnessed much activity in studying the sample efficiency of learning Nash equilibria in zero-sum Markov games, covering multiple different types of sampling schemes; for instance, Wei et al. (2017), Xie et al. (2020), Bai et al. (2020), Bai and Jin (2020), Liu et al. (2021), Jin et al. (2021a), Song et al. (2021), Mao and Başar (2022), Daskalakis et al. (2023), Tian et al. (2021), and Chen et al. (2022) focus on the online explorative environments, whereas Zhang et al. (2020a) pays attention to the scenario that assumes sampling access to a generative model. Whereas the majority of these works exhibits a sample complexity that scales at least as in order to learn an approximate NE, the recent work Bai et al. (2020) proposes a V-learning algorithm attaining a sample complexity that scales linearly with , thus matching the minimax-optimal lower bound up to a factor of H2. When a generative model is available, Li et al. (2022b) further develops an algorithm that learns ε-Nash using samples, which attains the minimax lower bound for nonstationary, finite-horizon MGs. The setting of general-sum, multiplayer Markov games is much more challenging given that learning Nash equilibria is known to be PPAD-complete (Daskalakis et al. 2009, Daskalakis 2013). Shifting attention to more tractable solution concepts, Jin et al. (2021a), Daskalakis et al. (2023), Mao and Başar (2022), and Song et al. (2021) propose algorithms that provably learn (coarse) correlated equilibria with sample complexities that scale linearly with (where Ai is the number of actions of the ith player), thereby breaking the curse of multiagents. Additionally, there are also several works investigating the turn-based setting in which the two players take actions in turn; see Sidford et al. (2020), Cui and Yang (2021), Jia et al. (2019), and Jin et al. (2022). Moreover, another two works Zhang et al. (2021b) and Abe and Kaneko (2020) study offline sampling oracles under uniform coverage requirements (which are clearly more stringent than the unilateral concentrability assumption). The interested readers are also referred to Zhang et al. (2021a) and Yang and Wang (2020) for an overview of recent development.
4.3. Model-Based RL
The method proposed in the current paper falls under the category of model-based algorithms, which decouple model estimation and policy learning (planning). The model-based approach is extensively studied in the single-agent setting, including the online exploration setting (Azar et al. 2017, Zhang et al. 2023), the case with a generative model (Azar et al. 2013, Agarwal et al. 2020, Jin and Sidford 2021, Wang et al. 2021, Li et al. 2024a), the offline RL setting (Xie et al. 2021, Li et al. 2024b), and turn-based Markov games (Cui and Yang 2021). Encouragingly, the model-based approach is capable of attaining minimax-optimal sample complexities in a variety of settings (e.g., Azar et al. 2017, Agarwal et al. 2020, Zhang et al. 2023, Li et al. 2024b), sometimes even without incurring any burn-in cost (Cui and Yang 2021; Zhang et al. 2023; Li et al. 2024a, b). The method proposed in Cui and Du (2022b) also exhibits the flavor of a model-based algorithm although an additional variance reduction scheme is incorporated in order to optimize the horizon dependency.
5. Additional Notation
Let us collect a set of additional notations that are used in the analysis. First of all, for any , any vector , and any probability transition kernel , we define
6. Proof of Theorem 1
Toward proving Theorem 1, we first state a slightly stronger result as follows.
Consider any initial state distribution and suppose that Assumption 2 holds. Assume that . Then, with probability exceeding , the policy pair returned by Algorithm 1 satisfies
As can be straightforwardly verified, Theorem 1 is a direct consequence of Theorem 3 (by taking the right-hand side of (23) to be no larger than ε).
The remainder of this section is, thus, dedicated to establishing Theorem 3. Before proceeding, let us now take a moment to provide a brief road map of the proof.
We first show in Section 6.2 that the pessimistic Bellman operators and introduced in (11) are both monotone and γ-contractive and admit unique fixed points and , respectively. These properties reveal that the pessimistic value iterations () in Algorithm 1 converge to () at a geometric rate, and therefore, it suffices to analyze the fixed points and .
Next, we show Bernstein-style concentration bounds for random quantities such as and in Section 6.3, in which and are the value functions associated with and . Because of the complicated statistical dependency between and , we use a leave-one-out argument to establish this concentration result in Lemma 2.
Finally, based on the aforementioned results, we derive error bounds for and in Section 6.4. Our analysis makes use of a self-bounding trick, which allows one to derive sharp estimation error bounds that turn out to be minimax-optimal.
6.1. Preliminary Facts
Before continuing, we collect several preliminary facts that are useful throughout.
For any , we have
(24)where V1 (V2) denotes the value function associated with Q1 (Q2); see (10) for the precise definition.For any , any probability transition kernel , and any , we have
(25)where is defined in (21).As a consequence, we also know that, for any and any , the corresponding penalty terms (cf. (12)) obey
(26)
The proof of the preceding results can be found in Online Appendix EC.1.1.
6.2. Step 1: Key Properties of Pessimistic Bellman Operators
Recall the definition of the pessimistic Bellman operators and introduced in (11). The following lemma gathers a couple of key properties of these two operators.
The following properties hold true:
(Monotonicity) For any , we have and .
(Contraction) Both operators are γ-contractive in the sense, that is,
for any Q1 and Q2.(Uniqueness of fixed points) () has a unique fixed point (), which also satisfies for any .
See Online Appendix EC.1.2. □
Next, we make note of several immediate consequences of Lemma 1. Here and throughout, and are defined to be the value functions (see (10)) associated with and , respectively.
First, the preceding lemma implies that
(27)To see this, we first note that . Next, suppose that for some iteration ; then, the monotonicity of (cf. Lemma 1) tells us that
from which (27) follows.In addition, the γ-contraction property in Lemma 1 leads to
(28)and, to justify this, observe thatwhich, together with and (24), givesA similar argument also yields
(29)
6.3. Step 2: Decoupling Statistical Dependency and Establishing Pessimism
To proceed, we rely on the following theorem to quantify the difference between and P when projected onto a value function direction.
For any satisfying with probability exceeding ,
The proof is deferred to Online Appendix EC.1.3.
In words, the first result (30) delivers some Bernstein-type concentration bound, whereas the second result (31) guarantees that the empirical variance estimate (i.e., the plug-in estimate) is close to the true variance. It is worth noting that Lemma 2 does not require to be statistically independent from , which is particularly crucial when coping with the complicated statistical dependency of our iterative algorithm. The proof of Lemma 2 is established upon a leave-one-out analysis argument (see, e.g., Chen et al. 2019a, b; Agarwal et al. 2020; Ma et al. 2020; Chen et al. 2021; Li et al. 2024a, b) that helps decouple statistical dependency; see details in Online Appendix EC.1.3. Armed with Lemma 2, we can readily see that
With probability exceeding , it holds that
The proof is provided in Online Appendix EC.1.4.
This lemma makes clear a key rationale for the principle of pessimism: we want the Q-function estimates to be always conservative uniformly over all entries.
6.4. Step 3: Bounding and
Before proceeding to bound , we first develop a lower bound on given that is lower bounded by (according to Lemma 3). Toward this end, we invoke the definition of to reach
With probability exceeding , we have
The proof is deferred to Online Appendix EC.1.5.
In addition, we can invoke Lemma 3 and the fact that to reach
Taking (37) and (39) collectively yields
In order to further control (42), we resort to the following lemma for bounding , whose proof can be found in Online Appendix EC.1.6.
There exists some large enough universal constant such that
This is with probability exceeding .
To finish, taking (38), (42) and Lemma 5 together gives
This completes the proof for Claim (22a). The proof for the other claim (22b) follows from an almost identical argument and is, hence, omitted.
6.5. Discussion: Instance-Dependent Statistical Bounds?
Thus far, we have presented the proof of Theorem 1 that concerns the minimax optimality of the model-based algorithm. Note that a recent line of work has attempted to move beyond minimax-optimal statistical guarantees and pursue more refined instance-optimal (or locally minimax) performance guarantees (Khamaru et al. 2021a, b, Mou et al. 2022). Here, we take a moment to discuss the challenges that need to be overcome in order to extend our analysis in an instance-optimal fashion.
A crucial step that allows us to obtain error bounds that scale linearly with A + B instead of the ones that scale linearly with AB in Cui and Du (2022b) is the introduction of an auxiliary policy ν0 in (34). This allows us to upper bound the error with
see (42). Whereas this facilitates our analysis, the terms (defined in (36)) and (defined in (41)) both depend on the auxiliary policy ν0. Because of the complicated dependency between ν0 (which is determined by a random function ) and the model parameters, it remains quite challenging to connect these error terms with instance-dependent quantities (i.e., model parameters) without losing optimality.Note that we might be able to resolve this issue in a coarse way, for example, by taking the supremum over all possible policy ν:
(43)Let us assume for the moment that this could work (despite the potential suboptimality of this error bound) and see what this lead to. By checking the proof of Lemma 5, we can see that, in order to upper bound (43) in an instance-optimal manner, it is important to relate to model parameters for all . Ideally, we can replace with the value function associated with the Nash equilibrium . However, based on our current analysis framework, we can only show that and are close, which is insufficient to guarantee the closeness of and for all .
As we briefly mention in Remark 2, prior literature Khamaru et al. (2021a), which establishes instance optimality for the optimal value estimation problem in single-agent RL under a generative model, requires one of the following two conditions: the optimal policy is unique or a sample complexity bound that depends on an optimality gap
(44)where is the optimal Q-function, is the set of deterministic policies, and is the set of optimal (deterministic) policies. However, neither condition has a direct analog in two-player, zero-sum Markov games: for the first one, this is because the Nash equilibrium of a zero-sum Markov game is not unique in general; for the second one, this is because the Nash equilibrium policy pair is usually random, and it is not clear how to define a nonzero optimality gap such as (44).
In view of these challenges, our current analysis framework remains incapable of deriving instance-optimal performance guarantees. Accomplishing instance-optimal results for zero-sum Markov games might require substantially more refined analysis techniques, and we leave this important direction to future investigation.
7. Discussion
In the present paper, we propose a model-based offline algorithm, which leverages the principle of pessimism in solving two-player, zero-sum Markov games on the basis of past data. In order to find an ε-approximate Nash equilibrium of the Markov game, our algorithm requires no more than samples, and this sample complexity bound is provably minimax optimal for the entire range of target accuracy level . Our theory improves upon prior sample complexity bounds in Cui and Yang (2021) in terms of the dependency on the size of the action space. Another appealing feature is the simplicity of our algorithm, which does not require complicated variance reduction schemes and is, hence, easier to implement and interpret. Moving forward, there are a couple of interesting directions that are worthy of future investigation. For instance, one natural extension is to explore whether the current algorithmic idea and analysis extend to multiagent, general-sum Markov games with the goal of learning other solution concepts of equilibria such as coarse correlated equilibria (given that finding Nash equilibria in general-sum games is PPAD-complete). Another topic of interest is to design model-free algorithms for offline NE learning in zero-sum or general-sum Markov games. Furthermore, the current paper focuses attention on tabular Markov games, and it would be of great interest to design sample-efficient, offline, multiagent algorithms in the presence of function approximation.
Y. Chen thanks Shicong Cen for helpful discussions about Markov games.
References
- 2020) Off-policy exploitability-evaluation in two-player zero-sum markov games. Preprint, submitted July 4, https://arxiv.org/abs/2007.02141.Google Scholar (
- 2020) Model-based reinforcement learning with a generative model is minimax optimal. Jacob A, Shivani A, eds. Proc. 33rd Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar (
- 2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar (
- 2017) Minimax regret bounds for reinforcement learning. Doina P, Yee Whye T, eds. Proc. 34th Internat. Conf. Machine Learn. (PMLR, New York), 263–272.Google Scholar (
- 2020) Provable self-play algorithms for competitive reinforcement learning. Daumé H III, Aarti S, eds. Proc. 37th Internat. Conf. Machine Learn. (PMLR, New York), 551–560.Google Scholar (
- 2020) Near-optimal reinforcement learning with self-play. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 2159–2170.Google Scholar (
- 2019) Emergent tool use from multi-agent autocurricula. Internat. Conf. Learn. Representations (Curran Associates, Inc., Red Hook, NY).Google Scholar (
- 2019) Dota 2 with large scale deep reinforcement learning. Preprint, submitted December 13, https://arxiv.org/abs/1912.06680.Google Scholar (
- 2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar (
- 2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.Crossref, Google Scholar (
- 2021) Fast policy extragradient methods for competitive games with entropy regularization. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 27952–27964.Google Scholar (
- 2021) Mitigating covariate shift in imitation learning via offline data without great coverage. Preprint, submitted June 6, https://arxiv.org/abs/2106.03207.Google Scholar (
- 2022) Almost optimal algorithms for two-player zero-sum linear mixture Markov games. Sanjoy D, Nika H, eds. Proc. 33rd Internat. Conf. Algorithmic Learn. Theory (PMLR, New York), 227–261.Google Scholar (
- 2019a) Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Math. Programming 176:5–37.Crossref, Google Scholar (
- 2021) Spectral methods for data science: A statistical perspective. Foundations Trends Machine Learn. 14(5):566–806.Google Scholar (
- 2019b) Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. USA 116(46):22931–22937.Crossref, Google Scholar (
- 2022a) Provably efficient offline multi-agent reinforcement learning via strategy-wise bonus. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates, Inc., Red Hook, NY), 11739–11751.Google Scholar (
- 2022b) When are offline two-player zero-sum Markov games solvable? Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates, Inc., Red Hook, NY), 25779–25791.Google Scholar (
- 2021) Minimax sample complexity for turn-based stochastic game. de Campos C, Maathuis MH, eds. Uncertainty Artificial Intelligence (PMLR, New York), 1496–1504.Google Scholar (
- 2013) On the complexity of approximating a nash equilibrium. ACM Trans. Algorithms 9(3):1–35.Crossref, Google Scholar (
- 2020) Independent policy gradient methods for competitive reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 5527–5540.Google Scholar (
- 2009) The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1):195–259.Crossref, Google Scholar (
- 2023) The complexity of Markov equilibrium in stochastic games. Gergely N, Lorenzo R, eds. Proc. 36th Annual Conf. Learn. Theory (PMLR, New York), 4180–4234.Google Scholar (
- 1999) Adaptive game playing using multiplicative weights. Games Econom. Behav. 29(1–2):79–103.Crossref, Google Scholar (
- 2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar (
- 2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4:1039–1069.Google Scholar (
- 2019) Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 364(6443):859–865.Crossref, Google Scholar (
- 2019) Feature-based q-learning for two-player stochastic games. Preprint, submitted June 2, https://arxiv.org/abs/1906.00423.Google Scholar (
- 2021) Toward tight bounds on the sample complexity of average-reward MDPs. Marina M, Tong Z, eds. Proc. 38th Internat. Conf. Machine Learn. (PMLR, New York), 5055–5064.Google Scholar (
- 2022) The complexity of infinite-horizon general-sum stochastic games. Preprint, submitted April 8, https://arxiv.org/abs/2204.04186.Google Scholar (
- 2021b) Is pessimism provably efficient for offline RL? Marina M, Tong Z, eds. Proc. 38th Internat. Conf. Machine Learn. (PMLR, New York), 5084–5096.Google Scholar (
- 2021a) V-learning—A simple, efficient, decentralized algorithm for multiagent RL. Preprint, submitted October 27, https://arxiv.org/abs/2110.14555.Google Scholar (
- 2021a) Instance-optimality in optimal value estimation: Adaptivity via variance-reduced q-learning. Preprint, submitted June 28, https://arxiv.org/abs/2106.14352.Google Scholar (
- 2021b) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.Crossref, Google Scholar (
- 2020) MOReL: Model-based offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 21810–21823.Google Scholar (
- 2020)
Conservative Q-learning for offline reinforcement learning . Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1179–1191.Google Scholar ( - 2002) Value function approximation in zero-sum Markov games. Proc. 18th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., Burlington, MA), 283–292.Google Scholar (
- 2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar (
- 2022a) Pessimism for offline linear contextual bandits using confidence sets. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates, Inc., Red Hook, NY), 20974–20987.Google Scholar (
- 2022b) Minimax-optimal multi-agent RL in zero-sum Markov games with a generative model. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 15353–15367.Google Scholar (
- 2024a) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Oper. Res. 72(1):203–221.Link, Google Scholar (
- 2024b) Settling the sample complexity of model-based offline reinforcement learning. Ann. Statist. 52(1):233–260.Crossref, Google Scholar (
- 2021) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 17762–17776.Google Scholar (
- 1994) Markov games as a framework for multi-agent reinforcement learning. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Machine Learn. Proc. (Elsevier, Amsterdam), 157–163.Google Scholar (
- 2001) Friend-or-foe Q-learning in general-sum games. Proc. 18th Internat. Conf. Machine Learn., vol. 1 (Morgan Kaufmann Publishers Inc., Burlington, MA), 322–328.Google Scholar (
- 2020) Provably good batch reinforcement learning without great exploration. Preprint, submitted July 16, https://arxiv.org/abs/2007.08202.Google Scholar (
- 2021) A sharp analysis of model-based reinforcement learning with self-play. Marina M, Tong Z, eds. Proc, 38th Internat. Conf. Machine Learn. (PMLR, New York), 7001–7010.Google Scholar (
- 2022) Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable Markov decision processes. Preprint, submitted May 26, https://arxiv.org/abs/2205.13589.Google Scholar (
- 2020) Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations Comput. Math. 20(3):451–632.Crossref, Google Scholar (
- 2022) Provably efficient reinforcement learning in decentralized general-sum Markov games. Dynamic Games Appl. 13:165–186.Google Scholar (
- 2022) Optimal variance-reduced stochastic approximation in Banach spaces. Preprint, submitted January 21, https://arxiv.org/abs/2201.08518.Google Scholar (
- 2003) Error bounds for approximate policy iteration. Proc. 20th Internat. Conf. Machine Learn., vol. 3 (AAAI Press, Washington, DC), 560–567.Google Scholar (
- 2007) Performance bounds in -norm for approximate value iteration. SIAM J. Control Optim. 46(2):541–561.Crossref, Google Scholar (
- 1951) Non-cooperative games. Ann. Math. 54(2):286–295.Crossref, Google Scholar (
- 2015) Approximate dynamic programming for two-player zero-sum Markov games. Francis B, David, eds. Proc. 32nd Internat. Conf. Machine Learn. (PMLR, New York), 1321–1329.Google Scholar (
- 1994) Zero-sum two-person games. Aumann R, Hart S, eds. Handbook of Game Theory with Economic Applications, vol. 2 (Elsevier, Amsterdam), 735–768.Google Scholar (
- 2013) Optimization, learning, and games with predictable sequences. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar (
- 2021) Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Neural Inform. Processing Systems (NeurIPS).Google Scholar (
- 2016) Twenty Lectures on Algorithmic Game Theory (Cambridge University Press, Cambridge, MA).Crossref, Google Scholar (
- 2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar (
- 1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar (
- 2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Chaudhuri C, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn. (PMLR, New York), 19967–20025.Google Scholar (
- 2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Silvia C, Roberto C, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statistics (PMLR, New York), 2992–3002.Google Scholar (
- 2021) When can we learn general-sum Markov games with a large number of players sample-efficiently? Preprint, submitted October 8, https://arxiv.org/abs/2110.04184.Google Scholar (
- 2021) Online learning in unknown Markov games. Marina M, Tong Z, eds. Internat. Conf. Machine Learn. (PMLR, New York), 10279–10288.Google Scholar (
- 2021) Pessimistic model-based offline reinforcement learning under partial coverage. Preprint, submitted July 13, https://arxiv.org/abs/2107.06226.Google Scholar (
- 2019) Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.Crossref, Google Scholar (
- 2022) On gap-dependent bounds for offline reinforcement learning. Preprint, submitted June 1, https://arxiv.org/abs/2206.00177.Google Scholar (
- 2021) Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 23009–23022.Google Scholar (
- 2017) Online reinforcement learning in stochastic games. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY).Google Scholar (
- 2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Mikhail B, Samory K, eds. Proc. 34th Conf. Learn. Theory (PMLR, New York), 4259–4299.Google Scholar (
- 2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Jacob A, Shivani A, eds. Proc. 33rd Conf. Learn. Theory (PMLR, New York), 3674–3682.Google Scholar (
- 2021) Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Preprint, submitted June 9, https://arxiv.org/abs/2106.04895.Google Scholar (
- 2022) Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent MDP and Markov game. Preprint, submitted May 31, https://arxiv.org/abs/2205.15512.Google Scholar (
- 2022) Provably efficient offline reinforcement learning with trajectory-wise reward. Preprint, submitted June 13, https://arxiv.org/abs/2206.06426.Google Scholar (
- 2023) The efficacy of pessimism in asynchronous Q-learning. IEEE Trans. Inform. Theory 69(11):7185–7219.Crossref, Google Scholar (
- 2020) An overview of multi-agent reinforcement learning from game theoretical perspective. Preprint, submitted November 1, https://arxiv.org/abs/2011.00583.Google Scholar (
- 2021) Toward instance-optimal offline reinforcement learning with pessimism. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY).Google Scholar (
- 2021a) Near-optimal offline reinforcement learning via double variance reduction. Preprint, submitted February 2, https://arxiv.org/abs/2102.01748.Google Scholar (
- 2021b) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Arindam B, Kenji F, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1567–1575.Google Scholar (
- 2021c) Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. Internat. Conf. Learn. Representations.Google Scholar (
- 2021a) Conservative data sharing for multi-task offline reinforcement learning. Preprint, submitted September 16, https://arxiv.org/abs/2109.08128.Google Scholar (
- 2021b) COMBO: Conservative offline model-based policy optimization. Preprint, submitted February 16, https://arxiv.org/abs/2102.08363.Google Scholar (
- 2020) MOPO: Model-based offline policy optimization. Preprint, submitted May 27, https://arxiv.org/abs/2005.13239.Google Scholar (
- 2021) Provable benefits of actor-critic methods for offline reinforcement learning. Preprint, submitted August 19, https://arxiv.org/abs/2108.08812.Google Scholar (
- 2021a) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Vamvoudakis KG, Wan Y, Lewis FL, Cansever D, eds. Handbook of Reinforcement Learning and Control (Springer, Cham), 321–384.Google Scholar (
- 2020b) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 15198–15207.Google Scholar (
- 2023) Settling the sample complexity of online reinforcement learning. Preprint, submitted July 25, https://arxiv.org/abs/2307.13586.Google Scholar (
- 2021c) Corruption-robust offline reinforcement learning. Preprint, submitted June 11, https://arxiv.org/abs/2106.06630.Google Scholar (
- 2020a) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1166–1178.Google Scholar (
- 2021b) Finite-sample analysis for decentralized batch multiagent reinforcement learning with networked agents. IEEE Trans. Automatic Control 66(12):5925–5940.Crossref, Google Scholar (
- 2021) Provably efficient policy gradient methods for two-player zero-sum Markov games. Preprint, submitted February 17, https://arxiv.org/abs/2102.08903.Google Scholar (
- 2022) Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets. Preprint, submitted February 15, https://arxiv.org/abs/2202.07511.Google Scholar (
Yuling Yan is a Norbert Wiener Postdoctoral Associate at the Massachusetts Institute of Technology. His research interests include statistics, optimization, and their applications in economics and social sciences. He has received the Institute of Mathematical Statistics Lawrence D. Brown Award, the Charlotte Elizabeth Procter Fellowship from Princeton University, the Norbert Wiener Postdoctoral Fellowship from MIT, and the International Consortium of Chinese Mathematicians Best Thesis Award.
Gen Li is an assistant professor of statistics at the Chinese University of Hong Kong. His research interests include reinforcement learning, high-dimensional statistics, machine learning, signal processing, and mathematical optimization. He has received the excellent graduate award and the excellent thesis award from Tsinghua University.
Yuxin Chen is an associate professor of statistics and data science at the University of Pennsylvania. His research interests include statistics, optimization, and machine learning. He has received the Alfred P. Sloan Research Fellowship, the Society for Industrial and Applied Mathematics Activity Group on Imaging Science Best Paper Prize, the International Consortium of Chinese Mathematicians Best Paper Award, and the Princeton Graduate Mentoring Award.
Jianqing Fan is the Frederick L. Moore Professor of Finance and Professor of Statistics at Princeton University. He was the past president of the Institute of Mathematical Statistics, and has received many awards and honors such as 2000 Committee of Presidents of Statistical Societies Presidents' Award, Guggenheim Fellow, Academician of Academia Sinica, and Royal Flemish Academy of Belgium. His research interests include statistics, data science, machine learning, and financial econometrics.