Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium
References
- [1] (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
- [2] (2019a) Exploration-enhanced POLITEX. Preprint, submitted August 27, https://doi.org/10.48550/arXiv.1908.10479.Google Scholar
- [3] Abbasi-Yadkori Y, Bartlett P, Bhatia K, Lazic N, Szepesvari C, Weisz G (2019b) POLITEX: Regret bounds for policy iteration using expert prediction. Chaudhuri K, Salakhutdinov R, eds. Proc. Internat. Conf. Machine Learn. (PMLR), 3692–3702.Google Scholar
- [4] (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Adv. Neural Inform. Processing Systems 30:1184–1194.Google Scholar
- [5] (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. Conf. Learn. Theory (PMLR), 67–83.Google Scholar
- [6] (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18.Crossref, Google Scholar
- [7] (2017) Minimax regret bounds for reinforcement learning. Precup D, Whye Teh Y, eds., Proc. Internat. Conf. Machine Learn., vol. 70 (PMLR), 263–272.Google Scholar
- [8] (2020) Provable self-play algorithms for competitive reinforcement learning. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 551–560.Google Scholar
- [9] (2020) Near-optimal reinforcement learning with self-play. Adv. Neural Inform. Processing Systems. 33:2159–2170.Google Scholar
- [10] (2019) Emergent tool use from multi-agent autocurricula. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
- [11] (2018) Emergent complexity via multi-agent competition. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
- [12] (2008) Regret minimization and the price of total anarchy. Dwork C, ed., Proc. 40th Annual ACM Symp. Theory Comput. (Association for Computing Machinery, New York), 373–382.Google Scholar
- [13] (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.Crossref, Google Scholar
- [14] (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.Crossref, Google Scholar
- [15] (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.Crossref, Google Scholar
- [16] (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Systems Man Cybernetics C 38(2):156–172.Crossref, Google Scholar
- [17] (2020) Provably efficient exploration in policy optimization. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1283–1294.Google Scholar
- [18] Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J. ACM 56(3):1–57.Crossref, Google Scholar
- [19] (2021) Almost optimal algorithms for two-player markov games with linear function approximation. Preprint, submitted February 15, https://doi.org/10.48550/arXiv.2102.07404.Google Scholar
- [20] (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Adv. Neural Inform. Processing Systems 30:5717–5727.Google Scholar
- [21] (2018) On oracle-efficient PAC rl with rich observations. Adv. Neural Inform. Processing Systems 31:1422–1432.Google Scholar
- [22] (2020) Independent policy gradient methods for competitive reinforcement learning. Adv. Neural Inform. Processing Systems 33:5527–5540.Google Scholar
- [23] (2009) The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1):195–259.Crossref, Google Scholar
- [24] (2014) Using response functions to measure strategy strength. Kambhampati S, ed., Proc. 28th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
- [25] (2020) Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Trans. Inform. Theory 66(11):7274–7301.Crossref, Google Scholar
- [26] (2020) n-regret for learning in Markov decision processes with function approximation and low Bellman rank. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 1554–1557.Google Scholar
- [27] (2019) Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
- [28] (2020) Is a good representation sufficient for sample efficient reinforcement learning? Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
- [29] (2019b) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32:8058–8068.Google Scholar
- [30] (2019a) Provably efficient RL with rich observations via latent state decoding. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1665–1674.Google Scholar
- [31] (2020) A theoretical analysis of deep q-learning. Bayen AM, Jadbabaie A, Pappas G, Parrilo PA, Recht B, Tomlin C, Zeilinger M, eds., Proc. Learn. Dynamics Control (PMLR), 486–489.Google Scholar
- [32] (2016) Learning to communicate with deep multi-agent reinforcement learning. Adv. Neural Inform. Processing Systems 29:2137–2145.Google Scholar
- [33] (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
- [34] (2018) Balancing two-player stochastic games with soft Q-learning. Lang J, ed., Proc. 27th Internat. Joint Conf. Artificial Intelligence, (International Joint Conferences on Artificial Intelligence, California), 268–274.Google Scholar
- [35] (2003) Correlated Q-learning. Fawcett T, Mishra N, eds., Proc. 20th Internat. Conf. Machine Learn., vol. 20 (AAAI Press, Menlo Park, CA), 242–249.Google Scholar
- [36] (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar
- [37] (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(Nov):1039–1069.Google Scholar
- [38] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
- [39] (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 3040–3049.Google Scholar
- [40] Jia Z, Yang LF, Wang M (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://doi.org/10.48550/arXiv.1906.00423.Google Scholar
- [41] (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D, Whye Teh Y, eds., Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR), 1704–1713.Google Scholar
- [42] (2021) The power of exploiter: Provable multi-agent rl in large state spaces. Preprint, https://doi.org/10.48550/arXiv.2106.03352.Google Scholar
- [43] (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
- [44] (2020a) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 2137–2143.Google Scholar
- [45] (2020b) Learning adversarial Markov decision processes with bandit feedback and unknown transition. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 4860–4869.Google Scholar
- [46] Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London.Google Scholar
- [47] (1999) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
- [48] (2002) Value function approximation in zero-sum Markov games. Darwiche A, Friedman N, eds., Proc. 18th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 283–292.Google Scholar
- [49] Lattimore T, Szepesvári C(2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
- [50] (2015) Deep learning. Nature 521(7553):436–444.Crossref, Google Scholar
- [51] (2013) Reinforcement learning in robust Markov decision processes. Adv. Neural Inform. Processing Systems 26:701–709.Google Scholar
- [52] (2001a) Friend-or-foe Q-learning in general sum games. Brodley CE, Danyluk AP, eds., Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 322–328.Google Scholar
- [53] (2001b) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.Crossref, Google Scholar
- [54] (1996) A generalized reinforcement-learning model: Convergence and applications. Saitta L, ed., Proc. 13th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 310–318.Google Scholar
- [55] (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7001–7010.Google Scholar
- [56] (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inform. Processing Systems 30:6379–6390.Google Scholar
- [57] (1970) On stochastic games. J. Optim. Theory Appl. 5(4):289–300.Crossref, Google Scholar
- [58] (1971) On stochastic games, ii. J. Optim. Theory Appl. 8(2):154–160.Crossref, Google Scholar
- [59] (2017) Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513.Crossref, Google Scholar
- [60] (1978) Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. Internat. J. Game Theory 7(3–4):201–221.Crossref, Google Scholar
- [61] (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
- [62] Open AI (2018) OpenAI Five. Accessed June 21, 2022, https://openai.com/blog/openai-five/.Google Scholar
- [63] (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 2377–2386.Google Scholar
- [64] Pananjady A, Wainwright MJ (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749.Google Scholar
- [65] (2008) Computing correlated equilibria in multi-player games. J. ACM 55(3):1–29.Crossref, Google Scholar
- [66] (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 919–928.Google Scholar
- [67] (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds., Proc. Artificial Intelligence Statist. (PMLR), 893–901.Google Scholar
- [68] (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds., Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, Lille, France), 1321–1329.Google Scholar
- [69] (2017) Learning Nash equilibrium for general-sum Markov games from batch data. Singh A, Zhu J, eds., Proc. Artificial Intelligence Statist. (PMLR), 232–241.Google Scholar
- [70] (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds., Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 1860–1868.Google Scholar
- [71] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- [72] (2019) Online stochastic shortest path with bandit feedback and unknown transition function. Adv. Neural Inform. Processing Systems 32:2209–2218.Google Scholar
- [73] (2019) Worst-case regret bounds for exploration via randomized value functions. Adv. Neural Inform. Processing Systems 32:14410–14420.Google Scholar
- [74] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://doi.org/10.48550/arXiv.1707.06347.Google Scholar
- [75] (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://doi.org/10.48550/arXiv.1610.03295.Google Scholar
- [76] (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar
- [77] (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2992–3002.Google Scholar
- [78] (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
- [79] (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
- [80] (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Adv. Neural Inform. Processing Systems 32:1151–1160.Google Scholar
- [81] (2018) Actor-critic policy optimization in partially observable multiagent environments. Adv. Neural Inform. Processing Systems 31:3422–3435.Google Scholar
- [82] (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds., Proc. 23rd Internat. Conf. Machine Learn. (Association for Computing Machinery, New York), 881–888.Google Scholar
- [83] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- [84] (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10279–10288.Google Scholar
- [85] Vershynin R (2012) Introduction to the non-asymptotic analysis of random matrices. Eldar YC, Kutyniok G, eds. Compressed Sensing (Cambridge University Press, Cambridge, UK), 210–268.Crossref, Google Scholar
- [86] (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.Crossref, Google Scholar
- [87] (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
- [88] (1992) Q-learning. Machine Learn. 30:279–292.Crossref, Google Scholar
- [89] (2017) Online reinforcement learning in stochastic games. Proc. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems, Long Beach, CA), 4987–4997.Google Scholar
- [90] (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Preprint, submitted February 8, https://doi.org/10.48550/arXiv.2102.04540.Google Scholar
- [91] (2017) Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res. 42(3):762–782.Link, Google Scholar
- [92] (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Abernethy J, Agarwal S, eds., Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 3674–3682.Google Scholar
- [93] (2012) Distributionally robust Markov decision processes. Math. Oper. Res. 37(2):288–300.Link, Google Scholar
- [94] (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 6995–7004.Google Scholar
- [95] (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10746–10756.Google Scholar
- [96] (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7304–7312.Google Scholar
- [97] (2020) Frequentist regret bounds for randomized least-squares value iteration. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 1954–1964.Google Scholar
- [98] Zhang K, Yang Z, Başar T (2019) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Preprint, submitted November 24, https://doi.org/10.48550/arXiv.1911.10635.Google Scholar
- [99] Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. Adv. Neural Inform. Processing Systems 33:1166–1178.Google Scholar

