Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

Published Online:https://doi.org/10.1287/moor.2022.1268

References

  • [1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
  • [2] Abbasi-Yadkori Y, Lazic N, Szepesvari C, Weisz G (2019a) Exploration-enhanced POLITEX. Preprint, submitted August 27, https://doi.org/10.48550/arXiv.1908.10479.Google Scholar
  • [3] Abbasi-Yadkori Y, Bartlett P, Bhatia K, Lazic N, Szepesvari C, Weisz G (2019b) POLITEX: Regret bounds for policy iteration using expert prediction. Chaudhuri K, Salakhutdinov R, eds. Proc. Internat. Conf. Machine Learn. (PMLR), 3692–3702.Google Scholar
  • [4] Agrawal S, Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Adv. Neural Inform. Processing Systems 30:1184–1194.Google Scholar
  • [5] Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. Conf. Learn. Theory (PMLR), 67–83.Google Scholar
  • [6] Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18.CrossrefGoogle Scholar
  • [7] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Whye Teh Y, eds., Proc. Internat. Conf. Machine Learn., vol. 70 (PMLR), 263–272.Google Scholar
  • [8] Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 551–560.Google Scholar
  • [9] Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. Adv. Neural Inform. Processing Systems. 33:2159–2170.Google Scholar
  • [10] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use from multi-agent autocurricula. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [11] Bansal T, Pachocki J, Sidor S, Sutskever I, Mordatch I (2018) Emergent complexity via multi-agent competition. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [12] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Dwork C, ed., Proc. 40th Annual ACM Symp. Theory Comput. (Association for Computing Machinery, New York), 373–382.Google Scholar
  • [13] Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.CrossrefGoogle Scholar
  • [14] Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.CrossrefGoogle Scholar
  • [15] Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.CrossrefGoogle Scholar
  • [16] Busoniu L, Babuska R, Schutter BD (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Systems Man Cybernetics C 38(2):156–172.CrossrefGoogle Scholar
  • [17] Cai Q, Yang Z, Jin C, Wang Z (2020) Provably efficient exploration in policy optimization. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1283–1294.Google Scholar
  • [18] Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J. ACM 56(3):1–57.CrossrefGoogle Scholar
  • [19] Chen Z, Zhou D, Gu Q (2021) Almost optimal algorithms for two-player markov games with linear function approximation. Preprint, submitted February 15, https://doi.org/10.48550/arXiv.2102.07404.Google Scholar
  • [20] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Adv. Neural Inform. Processing Systems 30:5717–5727.Google Scholar
  • [21] Dann C, Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2018) On oracle-efficient PAC rl with rich observations. Adv. Neural Inform. Processing Systems 31:1422–1432.Google Scholar
  • [22] Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. Adv. Neural Inform. Processing Systems 33:5527–5540.Google Scholar
  • [23] Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1):195–259.CrossrefGoogle Scholar
  • [24] Davis T, Burch N, Bowling M (2014) Using response functions to measure strategy strength. Kambhampati S, ed., Proc. 28th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
  • [25] Ding L, Chen Y (2020) Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Trans. Inform. Theory 66(11):7274–7301.CrossrefGoogle Scholar
  • [26] Dong K, Peng J, Wang Y, Zhou Y (2020) n-regret for learning in Markov decision processes with function approximation and low Bellman rank. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 1554–1557.Google Scholar
  • [27] Dong K, Wang Y, Chen X, Wang L (2019) Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [28] Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [29] Du SS, Luo Y, Wang R, Zhang H (2019b) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32:8058–8068.Google Scholar
  • [30] Du SS, Krishnamurthy A, Jiang N, Agarwal A, Dudik M, Langford J (2019a) Provably efficient RL with rich observations via latent state decoding. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1665–1674.Google Scholar
  • [31] Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Bayen AM, Jadbabaie A, Pappas G, Parrilo PA, Recht B, Tomlin C, Zeilinger M, eds., Proc. Learn. Dynamics Control (PMLR), 486–489.Google Scholar
  • [32] Foerster J, Assael IA, De Freitas N, Whiteson S (2016) Learning to communicate with deep multi-agent reinforcement learning. Adv. Neural Inform. Processing Systems 29:2137–2145.Google Scholar
  • [33] Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
  • [34] Grau-Moya J, Leibfried F, Bou-Ammar H (2018) Balancing two-player stochastic games with soft Q-learning. Lang J, ed., Proc. 27th Internat. Joint Conf. Artificial Intelligence, (International Joint Conferences on Artificial Intelligence, California), 268–274.Google Scholar
  • [35] Greenwald A, Hall K, Serrano R (2003) Correlated Q-learning. Fawcett T, Mishra N, eds., Proc. 20th Internat. Conf. Machine Learn., vol. 20 (AAAI Press, Menlo Park, CA), 242–249.Google Scholar
  • [36] Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.CrossrefGoogle Scholar
  • [37] Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(Nov):1039–1069.Google Scholar
  • [38] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
  • [39] Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo JZ, De Freitas N (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 3040–3049.Google Scholar
  • [40] Jia Z, Yang LF, Wang M (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://doi.org/10.48550/arXiv.1906.00423.Google Scholar
  • [41] Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D, Whye Teh Y, eds., Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR), 1704–1713.Google Scholar
  • [42] Jin C, Liu Q, Yu T (2021) The power of exploiter: Provable multi-agent rl in large state spaces. Preprint, https://doi.org/10.48550/arXiv.2106.03352.Google Scholar
  • [43] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
  • [44] Jin C, Yang Z, Wang Z, Jordan MI (2020a) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 2137–2143.Google Scholar
  • [45] Jin C, Jin T, Luo H, Sra S, Yu T (2020b) Learning adversarial Markov decision processes with bandit feedback and unknown transition. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 4860–4869.Google Scholar
  • [46] Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London.Google Scholar
  • [47] Konda VR, Tsitsiklis JN (1999) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
  • [48] Lagoudakis MG, Parr R (2002) Value function approximation in zero-sum Markov games. Darwiche A, Friedman N, eds., Proc. 18th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 283–292.Google Scholar
  • [49] Lattimore T, Szepesvári C(2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
  • [50] LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444.CrossrefGoogle Scholar
  • [51] Lim SH, Xu H, Mannor S (2013) Reinforcement learning in robust Markov decision processes. Adv. Neural Inform. Processing Systems 26:701–709.Google Scholar
  • [52] Littman ML (2001a) Friend-or-foe Q-learning in general sum games. Brodley CE, Danyluk AP, eds., Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 322–328.Google Scholar
  • [53] Littman ML (2001b) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.CrossrefGoogle Scholar
  • [54] Littman ML, Szepesvari C (1996) A generalized reinforcement-learning model: Convergence and applications. Saitta L, ed., Proc. 13th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 310–318.Google Scholar
  • [55] Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7001–7010.Google Scholar
  • [56] Lowe R, Wu Y, Tamar A, Harb J, Abbeel OP, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inform. Processing Systems 30:6379–6390.Google Scholar
  • [57] Maitra A, Parthasarathy T (1970) On stochastic games. J. Optim. Theory Appl. 5(4):289–300.CrossrefGoogle Scholar
  • [58] Maitra A, Parthasarathy T (1971) On stochastic games, ii. J. Optim. Theory Appl. 8(2):154–160.CrossrefGoogle Scholar
  • [59] Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513.CrossrefGoogle Scholar
  • [60] Moulin H, Vial JP (1978) Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. Internat. J. Game Theory 7(3–4):201–221.CrossrefGoogle Scholar
  • [61] Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
  • [62] Open AI (2018) OpenAI Five. Accessed June 21, 2022, https://openai.com/blog/openai-five/.Google Scholar
  • [63] Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 2377–2386.Google Scholar
  • [64] Pananjady A, Wainwright MJ (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749.Google Scholar
  • [65] Papadimitriou CH, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J. ACM 55(3):1–29.CrossrefGoogle Scholar
  • [66] Pérolat J, Piot B, Pietquin O (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 919–928.Google Scholar
  • [67] Pérolat J, Piot B, Scherrer B, Pietquin O (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds., Proc. Artificial Intelligence Statist. (PMLR), 893–901.Google Scholar
  • [68] Pérolat J, Scherrer B, Piot B, Pietquin O (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds., Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, Lille, France), 1321–1329.Google Scholar
  • [69] Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. Singh A, Zhu J, eds., Proc. Artificial Intelligence Statist. (PMLR), 232–241.Google Scholar
  • [70] Pérolat J, Piot B, Geist M, Scherrer B, Pietquin O (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds., Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 1860–1868.Google Scholar
  • [71] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • [72] Rosenberg A, Mansour Y (2019) Online stochastic shortest path with bandit feedback and unknown transition function. Adv. Neural Inform. Processing Systems 32:2209–2218.Google Scholar
  • [73] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Adv. Neural Inform. Processing Systems 32:14410–14420.Google Scholar
  • [74] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://doi.org/10.48550/arXiv.1707.06347.Google Scholar
  • [75] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://doi.org/10.48550/arXiv.1610.03295.Google Scholar
  • [76] Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.CrossrefGoogle Scholar
  • [77] Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2992–3002.Google Scholar
  • [78] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.CrossrefGoogle Scholar
  • [79] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • [80] Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Adv. Neural Inform. Processing Systems 32:1151–1160.Google Scholar
  • [81] Srinivasan S, Lanctot M, Zambaldi V, Pérolat J, Tuyls K, Munos R, Bowling M (2018) Actor-critic policy optimization in partially observable multiagent environments. Adv. Neural Inform. Processing Systems 31:3422–3435.Google Scholar
  • [82] Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds., Proc. 23rd Internat. Conf. Machine Learn. (Association for Computing Machinery, New York), 881–888.Google Scholar
  • [83] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [84] Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10279–10288.Google Scholar
  • [85] Vershynin R (2012) Introduction to the non-asymptotic analysis of random matrices. Eldar YC, Kutyniok G, eds. Compressed Sensing (Cambridge University Press, Cambridge, UK), 210–268.CrossrefGoogle Scholar
  • [86] Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.CrossrefGoogle Scholar
  • [87] Wang Y, Wang R, Du SS, Krishnamurthy A (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [88] Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 30:279–292.CrossrefGoogle Scholar
  • [89] Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. Proc. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems, Long Beach, CA), 4987–4997.Google Scholar
  • [90] Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Preprint, submitted February 8, https://doi.org/10.48550/arXiv.2102.04540.Google Scholar
  • [91] Wen Z, Van Roy B (2017) Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res. 42(3):762–782.LinkGoogle Scholar
  • [92] Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Abernethy J, Agarwal S, eds., Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 3674–3682.Google Scholar
  • [93] Xu H, Mannor S (2012) Distributionally robust Markov decision processes. Math. Oper. Res. 37(2):288–300.LinkGoogle Scholar
  • [94] Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 6995–7004.Google Scholar
  • [95] Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10746–10756.Google Scholar
  • [96] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7304–7312.Google Scholar
  • [97] Zanette A, Brandfonbrener D, Brunskill E, Pirotta M, Lazaric A (2020) Frequentist regret bounds for randomized least-squares value iteration. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 1954–1964.Google Scholar
  • [98] Zhang K, Yang Z, Başar T (2019) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Preprint, submitted November 24, https://doi.org/10.48550/arXiv.1911.10635.Google Scholar
  • [99] Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. Adv. Neural Inform. Processing Systems 33:1166–1178.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.