V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning

Chi Jin
Chi Jin
[email protected]
Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;
Search for more papers by this author
,
Qinghua Liu
Corresponding Author
Qinghua Liu
[email protected]
https://orcid.org/0000-0001-9418-7467
Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;
Search for more papers by this author
,
Yuanhao Wang
Yuanhao Wang
[email protected]
https://orcid.org/0000-0001-8979-0675
Department of Computer Science, Princeton University, Princeton, New Jersey 08544;
Search for more papers by this author
,
Tiancheng Yu
Tiancheng Yu
[email protected]
Department of Electrical and Computer Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Chi Jin

[email protected]

Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;

Search for more papers by this author

Qinghua Liu

Corresponding Author

Qinghua Liu

[email protected]

https://orcid.org/0000-0001-9418-7467

Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;

Search for more papers by this author

Yuanhao Wang

[email protected]

https://orcid.org/0000-0001-8979-0675

Department of Computer Science, Princeton University, Princeton, New Jersey 08544;

Search for more papers by this author

Tiancheng Yu

[email protected]

Department of Electrical and Computer Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:17 Nov 2023https://doi.org/10.1287/moor.2021.0317

References

[1] Agarwal A, Krishnamurthy A, Langford J, Luo H (2017) Open problem: First-order regret bounds for contextual bandits. Kale S, Shamir O, eds. Conf. Learn. Theory, vol. 65 (PMLR, Cambridge, MA), 4–7.Google Scholar
[2] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 263–272.Google Scholar
[3] Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. Bai Y, Jin C, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 551–560.Google Scholar
[4] Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 2159–2170.Google Scholar
[5] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2020) Emergent tool use from multi-agent autocurricula. 8th Internat. Conf. Learn. Representations.Google Scholar
[6] Blum A, Mansour Y (2007) From external to internal regret. J. Machine Learn. Res. 8(6):1307–1324.Google Scholar
[7] Brambilla M, Ferrante E, Birattari M, Dorigo M (2013) Swarm robotics: A review from the swarm engineering perspective. Swarm Intelligence 7(1):1–41.Crossref, Google Scholar
[8] Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.Crossref, Google Scholar
[9] Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.Crossref, Google Scholar
[10] Celli A, Marchesi A, Farina G, Gatti N (2020) No-regret learning dynamics for extensive-form correlated equilibrium. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 7722–7732.Google Scholar
[11] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[12] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 5713–5723.Google Scholar
[13] Daskalakis C (2013) On the complexity of approximating a Nash equilibrium. ACM Trans. Algorithms 9(3):23.Crossref, Google Scholar
[14] Daskalakis C, Foster DJ, Golowich N (2021) Independent policy gradient methods for competitive reinforcement learning. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 5527–5540.Google Scholar
[15] Daskalakis C, Golowich N, Zhang K (2022) The complexity of Markov equilibrium in stochastic games. Neu G, Rosasco L, eds. 36th Annual Conf. Theory, vol. 195 (PMLR, Cambridge, MA), 4180–4234.Google Scholar
[16] Erez L, Lancewicki T, Sherman U, Koren T, Mansour Y (2022) Regret minimization and convergence to equilibria in general-sum Markov games. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Internat. Conf. Machine Learn., vol. 202 (PMLR, Cambridge, MA), 9343–9373.Google Scholar
[17] Gilpin A, Sandholm T (2006) Finding equilibria in large sequential games of imperfect information. Feigenbaum J, Chuang JCI, Pennock DM, eds. Proc. 7th ACM Conf. Electronic Commerce (ACM), 160–169.Google Scholar
[18] Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar
[19] Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150.Crossref, Google Scholar
[20] Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(November):1039–1069.Google Scholar
[21] Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 2961–2970.Google Scholar
[22] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
[23] Jia Z, Yang LF, Wang M (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://arxiv.org/abs/1906.00423.Google Scholar
[24] Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low bellman rank are PAC-learnable. Precup D, Teh YW, eds. Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 1704–1713.Google Scholar
[25] Jin C, Liu Q, Miryoosefi S (2021) Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Ranzato MA, Beygelzimer B, Dauphin YN, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Neural Information Processing Systems Foundation, La Jolla, CA), 13406–13418.Google Scholar
[26] Jin C, Liu Q, Yu T (2021) The power of exploiter: Provable multi-agent RL in large state spaces. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S. eds. Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 10251–10279.Google Scholar
[27] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Neural Information Processing Systems Foundation, La Jolla, CA), 4868–4878.Google Scholar
[28] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy JD, Agarwal S, eds. Conf. Learn. Theory, vol. 125 (PMLR, Cambridge, MA), 2137–2143.Google Scholar
[29] Kao H, Wei CY, Subramanian V (2022) Decentralized cooperative reinforcement learning with hierarchical information structure. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, vol. 167 (PMLR, Cambridge, MA), 573–605.Google Scholar
[30] Koller D, Megiddo N (1992) The complexity of two-person zero-sum games in extensive form. Games Econom. Behav. 4(4):528–552.Crossref, Google Scholar
[31] Lattimore T, Szepesvári C (2018) Bandit algorithms. Working paper, Cambridge University Press, Cambridge.Google Scholar
[32] Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. Cohen WW, Hirsh H, eds. Machine Learning Proceedings 1994 (Morgan Kaufmann, Elsevier, Amsterdam), 157–163.Crossref, Google Scholar
[33] Littman ML (2001) Friend-or-foe Q-learning in general-sum games. Brodley CE, Danyluk AP, eds. Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann, Elsevier, Amsterdam), 322–328.Google Scholar
[34] Liu Q, Wang Y, Jin C (2022) Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 14036–14053.Google Scholar
[35] Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 7001–7010.Google Scholar
[36] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 6379–6390.Google Scholar
[37] Mao W, Başar T (2023) Provably efficient reinforcement learning in decentralized general-sum Markov games. Dyn. Games Appl. 13(1):165–186.Google Scholar
[38] Mao W, Yang L, Zhang K, Basar T (2022) On improving model-free algorithms for decentralized multi-agent reinforcement learning. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds, Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 15007–15049.Google Scholar
[39] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 28 (Neural Information Processing Systems Foundation, La Jolla, CA), 3168–3176.Google Scholar
[40] Open AI (2018) Open AI five. Accessed October 21, 2023, https://blog.openai.com/openai-five/.Google Scholar
[41] Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
[42] Osband I, Van Roy B, Wen Z (2014) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Internat. Conf. Machine Learn., vol. 48 (JMLR), 2377–2386.Google Scholar
[43] Osborne MJ, Rubinstein A (1994) A Course in Game Theory (MIT Press, Cambridge, MA).Google Scholar
[44] Radanovic G, Devidze R, Parkes D, Singla A (2019) Learning to collaborate in Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5261–5270.Google Scholar
[45] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 4295–4304.Google Scholar
[46] Sayin MO, Zhang K, Leslie DS, Basar T, Ozdaglar A (2021) Decentralized Q-learning in zero-sum Markov games. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 18320–18334.Google Scholar
[47] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
[48] Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar
[49] Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds. Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, Cambridge, MA), 2992–3002.Google Scholar
[50] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
[51] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
[52] Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5887–5896.Google Scholar
[53] Song Z, Mei S, Bai Y (2021) When can we learn general-sum Markov games with a large number of players sample-efficiently? 10th Internat. Conf. Learn. Representations.Google Scholar
[54] Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen WW, Moore AW, eds. Internat. Conf. Machine Learn., vol. 148 (ACM), 881–888.Google Scholar
[55] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, et al. (2018) Value-decomposition networks for cooperative multi-agent learning. André E, Koenig S, Dastani M, Sukthankar G, eds. Internat. Conf. Autonomous Agents and MultiAgent Systems (ACM), 2085–2087.Google Scholar
[56] Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 10279–10288.Google Scholar
[57] Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, et al. (2019) Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.Crossref, Google Scholar
[58] Wei CY (2021) Analysis of UCSG in the finite-horizon setting.Google Scholar
[59] Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 4987–4997.Google Scholar
[60] Wei CY, Lee CW, Zhang M, Luo H (2021) Linear last-iterate convergence in constrained saddle-point optimization. Internat. Conf. Learn. Representations.Google Scholar
[61] Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Belkin M, Kpotufe S, eds. Conf. Learn. Theory, vol. 134 (PMLR, Cambridge, MA), 4259–4299.Google Scholar
[62] Xie Q, Chen Y, Wang Z, Yang Z (2023) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Math. Oper. Res. 48(1):433–462.Google Scholar
[63] Zanette A, Lazaric A, Kochenderfer M, Brunskill E (2020) Learning near optimal policies with low inherent bellman error. Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 10978–10989.Google Scholar
[64] Zhan W, Lee JD, Yang Z (2023) Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in Markov games. Internat. Conf. Learn. Representations.Google Scholar
[65] Zhang K, Kakade SM, Başar T, Yang LF (2023) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. J. Mach. Learn. Res. 24:175.Google Scholar
[66] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 5867–5876.Google Scholar
[67] Zinkevich M, Johanson M, Bowling M, Piccione C (2007) Regret minimization in games with incomplete information. Adv. Neural Inform. Processing Systems, vol. 20 (Curran Associates, Inc.), 1729–1736.Google Scholar

cover image Mathematics of Operations Research

Volume 49, Issue 4

November 2024

Pages 2049-2802, C2

Article Information

Metrics

Information

Received:November 17, 2021
Accepted:June 04, 2023
Published Online:November 17, 2023

Cite as

Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu (2023) V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning. Mathematics of Operations Research 49(4):2295-2322.

https://doi.org/10.1287/moor.2021.0317

Keywords

Acknowledgments

A subset of the results in this paper was presented in the conference version (Bai et al. [4]) at Conference on Neural Information Processing Systems 2020.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning

References

Volume 49, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News