Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

Qiaomin Xie
Corresponding Author
Qiaomin Xie
[email protected]
https://orcid.org/0000-0003-2834-6866
Department of Industrial Systems and Engineering, University of Wisconsin-Madison, Madison, Wisconsin 53706;
Search for more papers by this author
,
Yudong Chen
Yudong Chen
[email protected]
https://orcid.org/0000-0002-6416-5635
Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin 53706;
Search for more papers by this author
,
Zhaoran Wang
Zhaoran Wang
[email protected]
https://orcid.org/0000-0002-1824-2580
Department of Industrial Engineering and Management Sciences, Northwestern University Evanston, Illinois 60208;
Search for more papers by this author
,
Zhuoran Yang
Zhuoran Yang
[email protected]
https://orcid.org/0000-0001-5269-9958
Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511
Search for more papers by this author

Qiaomin Xie

Corresponding Author

Qiaomin Xie

[email protected]

https://orcid.org/0000-0003-2834-6866

Department of Industrial Systems and Engineering, University of Wisconsin-Madison, Madison, Wisconsin 53706;

Search for more papers by this author

Yudong Chen

[email protected]

https://orcid.org/0000-0002-6416-5635

Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin 53706;

Search for more papers by this author

Zhaoran Wang

[email protected]

https://orcid.org/0000-0002-1824-2580

Department of Industrial Engineering and Management Sciences, Northwestern University Evanston, Illinois 60208;

Search for more papers by this author

Zhuoran Yang

[email protected]

https://orcid.org/0000-0001-5269-9958

Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511

Search for more papers by this author

Published Online:29 Jun 2022https://doi.org/10.1287/moor.2022.1268

References

[1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
[2] Abbasi-Yadkori Y, Lazic N, Szepesvari C, Weisz G (2019a) Exploration-enhanced POLITEX. Preprint, submitted August 27, https://doi.org/10.48550/arXiv.1908.10479.Google Scholar
[3] Abbasi-Yadkori Y, Bartlett P, Bhatia K, Lazic N, Szepesvari C, Weisz G (2019b) POLITEX: Regret bounds for policy iteration using expert prediction. Chaudhuri K, Salakhutdinov R, eds. Proc. Internat. Conf. Machine Learn. (PMLR), 3692–3702.Google Scholar
[4] Agrawal S, Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Adv. Neural Inform. Processing Systems 30:1184–1194.Google Scholar
[5] Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Abernethy J, Agarwal S, eds. Proc. Conf. Learn. Theory (PMLR), 67–83.Google Scholar
[6] Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18.Crossref, Google Scholar
[7] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Whye Teh Y, eds., Proc. Internat. Conf. Machine Learn., vol. 70 (PMLR), 263–272.Google Scholar
[8] Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 551–560.Google Scholar
[9] Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. Adv. Neural Inform. Processing Systems. 33:2159–2170.Google Scholar
[10] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use from multi-agent autocurricula. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[11] Bansal T, Pachocki J, Sidor S, Sutskever I, Mordatch I (2018) Emergent complexity via multi-agent competition. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[12] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Dwork C, ed., Proc. 40th Annual ACM Symp. Theory Comput. (Association for Computing Machinery, New York), 373–382.Google Scholar
[13] Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.Crossref, Google Scholar
[14] Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.Crossref, Google Scholar
[15] Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.Crossref, Google Scholar
[16] Busoniu L, Babuska R, Schutter BD (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Systems Man Cybernetics C 38(2):156–172.Crossref, Google Scholar
[17] Cai Q, Yang Z, Jin C, Wang Z (2020) Provably efficient exploration in policy optimization. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1283–1294.Google Scholar
[18] Chen X, Deng X, Teng SH (2009) Settling the complexity of computing two-player Nash equilibria. J. ACM 56(3):1–57.Crossref, Google Scholar
[19] Chen Z, Zhou D, Gu Q (2021) Almost optimal algorithms for two-player markov games with linear function approximation. Preprint, submitted February 15, https://doi.org/10.48550/arXiv.2102.07404.Google Scholar
[20] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Adv. Neural Inform. Processing Systems 30:5717–5727.Google Scholar
[21] Dann C, Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2018) On oracle-efficient PAC rl with rich observations. Adv. Neural Inform. Processing Systems 31:1422–1432.Google Scholar
[22] Daskalakis C, Foster DJ, Golowich N (2020) Independent policy gradient methods for competitive reinforcement learning. Adv. Neural Inform. Processing Systems 33:5527–5540.Google Scholar
[23] Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1):195–259.Crossref, Google Scholar
[24] Davis T, Burch N, Bowling M (2014) Using response functions to measure strategy strength. Kambhampati S, ed., Proc. 28th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
[25] Ding L, Chen Y (2020) Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Trans. Inform. Theory 66(11):7274–7301.Crossref, Google Scholar
[26] Dong K, Peng J, Wang Y, Zhou Y (2020) n-regret for learning in Markov decision processes with function approximation and low Bellman rank. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 1554–1557.Google Scholar
[27] Dong K, Wang Y, Chen X, Wang L (2019) Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[28] Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[29] Du SS, Luo Y, Wang R, Zhang H (2019b) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32:8058–8068.Google Scholar
[30] Du SS, Krishnamurthy A, Jiang N, Agarwal A, Dudik M, Langford J (2019a) Provably efficient RL with rich observations via latent state decoding. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 1665–1674.Google Scholar
[31] Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Bayen AM, Jadbabaie A, Pappas G, Parrilo PA, Recht B, Tomlin C, Zeilinger M, eds., Proc. Learn. Dynamics Control (PMLR), 486–489.Google Scholar
[32] Foerster J, Assael IA, De Freitas N, Whiteson S (2016) Learning to communicate with deep multi-agent reinforcement learning. Adv. Neural Inform. Processing Systems 29:2137–2145.Google Scholar
[33] Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
[34] Grau-Moya J, Leibfried F, Bou-Ammar H (2018) Balancing two-player stochastic games with soft Q-learning. Lang J, ed., Proc. 27th Internat. Joint Conf. Artificial Intelligence, (International Joint Conferences on Artificial Intelligence, California), 268–274.Google Scholar
[35] Greenwald A, Hall K, Serrano R (2003) Correlated Q-learning. Fawcett T, Mishra N, eds., Proc. 20th Internat. Conf. Machine Learn., vol. 20 (AAAI Press, Menlo Park, CA), 242–249.Google Scholar
[36] Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar
[37] Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(Nov):1039–1069.Google Scholar
[38] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
[39] Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo JZ, De Freitas N (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 3040–3049.Google Scholar
[40] Jia Z, Yang LF, Wang M (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://doi.org/10.48550/arXiv.1906.00423.Google Scholar
[41] Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D, Whye Teh Y, eds., Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR), 1704–1713.Google Scholar
[42] Jin C, Liu Q, Yu T (2021) The power of exploiter: Provable multi-agent rl in large state spaces. Preprint, https://doi.org/10.48550/arXiv.2106.03352.Google Scholar
[43] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
[44] Jin C, Yang Z, Wang Z, Jordan MI (2020a) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds., Proc. Conf. Learn. Theory (PMLR), 2137–2143.Google Scholar
[45] Jin C, Jin T, Luo H, Sra S, Yu T (2020b) Learning adversarial Markov decision processes with bandit feedback and unknown transition. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 4860–4869.Google Scholar
[46] Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London.Google Scholar
[47] Konda VR, Tsitsiklis JN (1999) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
[48] Lagoudakis MG, Parr R (2002) Value function approximation in zero-sum Markov games. Darwiche A, Friedman N, eds., Proc. 18th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 283–292.Google Scholar
[49] Lattimore T, Szepesvári C(2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
[50] LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444.Crossref, Google Scholar
[51] Lim SH, Xu H, Mannor S (2013) Reinforcement learning in robust Markov decision processes. Adv. Neural Inform. Processing Systems 26:701–709.Google Scholar
[52] Littman ML (2001a) Friend-or-foe Q-learning in general sum games. Brodley CE, Danyluk AP, eds., Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 322–328.Google Scholar
[53] Littman ML (2001b) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.Crossref, Google Scholar
[54] Littman ML, Szepesvari C (1996) A generalized reinforcement-learning model: Convergence and applications. Saitta L, ed., Proc. 13th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 310–318.Google Scholar
[55] Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7001–7010.Google Scholar
[56] Lowe R, Wu Y, Tamar A, Harb J, Abbeel OP, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inform. Processing Systems 30:6379–6390.Google Scholar
[57] Maitra A, Parthasarathy T (1970) On stochastic games. J. Optim. Theory Appl. 5(4):289–300.Crossref, Google Scholar
[58] Maitra A, Parthasarathy T (1971) On stochastic games, ii. J. Optim. Theory Appl. 8(2):154–160.Crossref, Google Scholar
[59] Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513.Crossref, Google Scholar
[60] Moulin H, Vial JP (1978) Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. Internat. J. Game Theory 7(3–4):201–221.Crossref, Google Scholar
[61] Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
[62] Open AI (2018) OpenAI Five. Accessed June 21, 2022, https://openai.com/blog/openai-five/.Google Scholar
[63] Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 2377–2386.Google Scholar
[64] Pananjady A, Wainwright MJ (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749.Google Scholar
[65] Papadimitriou CH, Roughgarden T (2008) Computing correlated equilibria in multi-player games. J. ACM 55(3):1–29.Crossref, Google Scholar
[66] Pérolat J, Piot B, Pietquin O (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 919–928.Google Scholar
[67] Pérolat J, Piot B, Scherrer B, Pietquin O (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds., Proc. Artificial Intelligence Statist. (PMLR), 893–901.Google Scholar
[68] Pérolat J, Scherrer B, Piot B, Pietquin O (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds., Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, Lille, France), 1321–1329.Google Scholar
[69] Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. Singh A, Zhu J, eds., Proc. Artificial Intelligence Statist. (PMLR), 232–241.Google Scholar
[70] Pérolat J, Piot B, Geist M, Scherrer B, Pietquin O (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds., Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 1860–1868.Google Scholar
[71] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
[72] Rosenberg A, Mansour Y (2019) Online stochastic shortest path with bandit feedback and unknown transition function. Adv. Neural Inform. Processing Systems 32:2209–2218.Google Scholar
[73] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Adv. Neural Inform. Processing Systems 32:14410–14420.Google Scholar
[74] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://doi.org/10.48550/arXiv.1707.06347.Google Scholar
[75] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://doi.org/10.48550/arXiv.1610.03295.Google Scholar
[76] Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar
[77] Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2992–3002.Google Scholar
[78] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
[79] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
[80] Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Adv. Neural Inform. Processing Systems 32:1151–1160.Google Scholar
[81] Srinivasan S, Lanctot M, Zambaldi V, Pérolat J, Tuyls K, Munos R, Bowling M (2018) Actor-critic policy optimization in partially observable multiagent environments. Adv. Neural Inform. Processing Systems 31:3422–3435.Google Scholar
[82] Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds., Proc. 23rd Internat. Conf. Machine Learn. (Association for Computing Machinery, New York), 881–888.Google Scholar
[83] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
[84] Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10279–10288.Google Scholar
[85] Vershynin R (2012) Introduction to the non-asymptotic analysis of random matrices. Eldar YC, Kutyniok G, eds. Compressed Sensing (Cambridge University Press, Cambridge, UK), 210–268.Crossref, Google Scholar
[86] Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.Crossref, Google Scholar
[87] Wang Y, Wang R, Du SS, Krishnamurthy A (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[88] Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 30:279–292.Crossref, Google Scholar
[89] Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. Proc. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems, Long Beach, CA), 4987–4997.Google Scholar
[90] Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Preprint, submitted February 8, https://doi.org/10.48550/arXiv.2102.04540.Google Scholar
[91] Wen Z, Van Roy B (2017) Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res. 42(3):762–782.Link, Google Scholar
[92] Xie Q, Chen Y, Wang Z, Yang Z (2020) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Abernethy J, Agarwal S, eds., Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 3674–3682.Google Scholar
[93] Xu H, Mannor S (2012) Distributionally robust Markov decision processes. Math. Oper. Res. 37(2):288–300.Link, Google Scholar
[94] Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 6995–7004.Google Scholar
[95] Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Daume III H, Singh A, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 10746–10756.Google Scholar
[96] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds., Proc. Internat. Conf. Machine Learn. (PMLR), 7304–7312.Google Scholar
[97] Zanette A, Brandfonbrener D, Brunskill E, Pirotta M, Lazaric A (2020) Frequentist regret bounds for randomized least-squares value iteration. Chiappa S, Calandra R, eds., Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR), 1954–1964.Google Scholar
[98] Zhang K, Yang Z, Başar T (2019) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Preprint, submitted November 24, https://doi.org/10.48550/arXiv.1911.10635.Google Scholar
[99] Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. Adv. Neural Inform. Processing Systems 33:1166–1178.Google Scholar

cover image Mathematics of Operations Research

Volume 48, Issue 1

February 2023

Pages 1-602, C2

Article Information

Metrics

Information

Received:May 29, 2020
Accepted:February 12, 2022
Published Online:June 29, 2022

Cite as

Qiaomin Xie Yudong Chen, Zhaoran Wang, Zhuoran Yang (2022) Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium. Mathematics of Operations Research 48(1):433-462.

https://doi.org/10.1287/moor.2022.1268

Keywords

Acknowledgments

The authors thank the anonymous reviewers at the Conference on Learning Theory 2020 for their helpful comments, and Siddhartha Banerjee for inspiring discussion.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

References

Volume 48, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News