V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning

Published Online:https://doi.org/10.1287/moor.2021.0317

References

  • [1] Agarwal A, Krishnamurthy A, Langford J, Luo H (2017) Open problem: First-order regret bounds for contextual bandits. Kale S, Shamir O, eds. Conf. Learn. Theory, vol. 65 (PMLR, Cambridge, MA), 4–7.Google Scholar
  • [2] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 263–272.Google Scholar
  • [3] Bai Y, Jin C (2020) Provable self-play algorithms for competitive reinforcement learning. Bai Y, Jin C, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 551–560.Google Scholar
  • [4] Bai Y, Jin C, Yu T (2020) Near-optimal reinforcement learning with self-play. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 2159–2170.Google Scholar
  • [5] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2020) Emergent tool use from multi-agent autocurricula. 8th Internat. Conf. Learn. Representations.Google Scholar
  • [6] Blum A, Mansour Y (2007) From external to internal regret. J. Machine Learn. Res. 8(6):1307–1324.Google Scholar
  • [7] Brambilla M, Ferrante E, Birattari M, Dorigo M (2013) Swarm robotics: A review from the swarm engineering perspective. Swarm Intelligence 7(1):1–41.CrossrefGoogle Scholar
  • [8] Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.CrossrefGoogle Scholar
  • [9] Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.CrossrefGoogle Scholar
  • [10] Celli A, Marchesi A, Farina G, Gatti N (2020) No-regret learning dynamics for extensive-form correlated equilibrium. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 7722–7732.Google Scholar
  • [11] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [12] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 5713–5723.Google Scholar
  • [13] Daskalakis C (2013) On the complexity of approximating a Nash equilibrium. ACM Trans. Algorithms 9(3):23.CrossrefGoogle Scholar
  • [14] Daskalakis C, Foster DJ, Golowich N (2021) Independent policy gradient methods for competitive reinforcement learning. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 5527–5540.Google Scholar
  • [15] Daskalakis C, Golowich N, Zhang K (2022) The complexity of Markov equilibrium in stochastic games. Neu G, Rosasco L, eds. 36th Annual Conf. Theory, vol. 195 (PMLR, Cambridge, MA), 4180–4234.Google Scholar
  • [16] Erez L, Lancewicki T, Sherman U, Koren T, Mansour Y (2022) Regret minimization and convergence to equilibria in general-sum Markov games. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Internat. Conf. Machine Learn., vol. 202 (PMLR, Cambridge, MA), 9343–9373.Google Scholar
  • [17] Gilpin A, Sandholm T (2006) Finding equilibria in large sequential games of imperfect information. Feigenbaum J, Chuang JCI, Pennock DM, eds. Proc. 7th ACM Conf. Electronic Commerce (ACM), 160–169.Google Scholar
  • [18] Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.CrossrefGoogle Scholar
  • [19] Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150.CrossrefGoogle Scholar
  • [20] Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(November):1039–1069.Google Scholar
  • [21] Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 2961–2970.Google Scholar
  • [22] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
  • [23] Jia Z, Yang LF, Wang M (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://arxiv.org/abs/1906.00423.Google Scholar
  • [24] Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low bellman rank are PAC-learnable. Precup D, Teh YW, eds. Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 1704–1713.Google Scholar
  • [25] Jin C, Liu Q, Miryoosefi S (2021) Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Ranzato MA, Beygelzimer B, Dauphin YN, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Neural Information Processing Systems Foundation, La Jolla, CA), 13406–13418.Google Scholar
  • [26] Jin C, Liu Q, Yu T (2021) The power of exploiter: Provable multi-agent RL in large state spaces. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S. eds. Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 10251–10279.Google Scholar
  • [27] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Neural Information Processing Systems Foundation, La Jolla, CA), 4868–4878.Google Scholar
  • [28] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy JD, Agarwal S, eds. Conf. Learn. Theory, vol. 125 (PMLR, Cambridge, MA), 2137–2143.Google Scholar
  • [29] Kao H, Wei CY, Subramanian V (2022) Decentralized cooperative reinforcement learning with hierarchical information structure. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, vol. 167 (PMLR, Cambridge, MA), 573–605.Google Scholar
  • [30] Koller D, Megiddo N (1992) The complexity of two-person zero-sum games in extensive form. Games Econom. Behav. 4(4):528–552.CrossrefGoogle Scholar
  • [31] Lattimore T, Szepesvári C (2018) Bandit algorithms. Working paper, Cambridge University Press, Cambridge.Google Scholar
  • [32] Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. Cohen WW, Hirsh H, eds. Machine Learning Proceedings 1994 (Morgan Kaufmann, Elsevier, Amsterdam), 157–163.CrossrefGoogle Scholar
  • [33] Littman ML (2001) Friend-or-foe Q-learning in general-sum games. Brodley CE, Danyluk AP, eds. Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann, Elsevier, Amsterdam), 322–328.Google Scholar
  • [34] Liu Q, Wang Y, Jin C (2022) Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 14036–14053.Google Scholar
  • [35] Liu Q, Yu T, Bai Y, Jin C (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 7001–7010.Google Scholar
  • [36] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 6379–6390.Google Scholar
  • [37] Mao W, Başar T (2023) Provably efficient reinforcement learning in decentralized general-sum Markov games. Dyn. Games Appl. 13(1):165–186.Google Scholar
  • [38] Mao W, Yang L, Zhang K, Basar T (2022) On improving model-free algorithms for decentralized multi-agent reinforcement learning. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds, Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 15007–15049.Google Scholar
  • [39] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 28 (Neural Information Processing Systems Foundation, La Jolla, CA), 3168–3176.Google Scholar
  • [40] Open AI (2018) Open AI five. Accessed October 21, 2023, https://blog.openai.com/openai-five/.Google Scholar
  • [41] Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
  • [42] Osband I, Van Roy B, Wen Z (2014) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Internat. Conf. Machine Learn., vol. 48 (JMLR), 2377–2386.Google Scholar
  • [43] Osborne MJ, Rubinstein A (1994) A Course in Game Theory (MIT Press, Cambridge, MA).Google Scholar
  • [44] Radanovic G, Devidze R, Parkes D, Singla A (2019) Learning to collaborate in Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5261–5270.Google Scholar
  • [45] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 4295–4304.Google Scholar
  • [46] Sayin MO, Zhang K, Leslie DS, Basar T, Ozdaglar A (2021) Decentralized Q-learning in zero-sum Markov games. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 18320–18334.Google Scholar
  • [47] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
  • [48] Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.CrossrefGoogle Scholar
  • [49] Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds. Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, Cambridge, MA), 2992–3002.Google Scholar
  • [50] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489.CrossrefGoogle Scholar
  • [51] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • [52] Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5887–5896.Google Scholar
  • [53] Song Z, Mei S, Bai Y (2021) When can we learn general-sum Markov games with a large number of players sample-efficiently? 10th Internat. Conf. Learn. Representations.Google Scholar
  • [54] Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen WW, Moore AW, eds. Internat. Conf. Machine Learn., vol. 148 (ACM), 881–888.Google Scholar
  • [55] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, et al. (2018) Value-decomposition networks for cooperative multi-agent learning. André E, Koenig S, Dastani M, Sukthankar G, eds. Internat. Conf. Autonomous Agents and MultiAgent Systems (ACM), 2085–2087.Google Scholar
  • [56] Tian Y, Wang Y, Yu T, Sra S (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 10279–10288.Google Scholar
  • [57] Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, et al. (2019) Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.CrossrefGoogle Scholar
  • [58] Wei CY (2021) Analysis of UCSG in the finite-horizon setting.Google Scholar
  • [59] Wei CY, Hong YT, Lu CJ (2017) Online reinforcement learning in stochastic games. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 4987–4997.Google Scholar
  • [60] Wei CY, Lee CW, Zhang M, Luo H (2021) Linear last-iterate convergence in constrained saddle-point optimization. Internat. Conf. Learn. Representations.Google Scholar
  • [61] Wei CY, Lee CW, Zhang M, Luo H (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Belkin M, Kpotufe S, eds. Conf. Learn. Theory, vol. 134 (PMLR, Cambridge, MA), 4259–4299.Google Scholar
  • [62] Xie Q, Chen Y, Wang Z, Yang Z (2023) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Math. Oper. Res. 48(1):433–462.Google Scholar
  • [63] Zanette A, Lazaric A, Kochenderfer M, Brunskill E (2020) Learning near optimal policies with low inherent bellman error. Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 10978–10989.Google Scholar
  • [64] Zhan W, Lee JD, Yang Z (2023) Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in Markov games. Internat. Conf. Learn. Representations.Google Scholar
  • [65] Zhang K, Kakade SM, Başar T, Yang LF (2023) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. J. Mach. Learn. Res. 24:175.Google Scholar
  • [66] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 5867–5876.Google Scholar
  • [67] Zinkevich M, Johanson M, Bowling M, Piccione C (2007) Regret minimization in games with incomplete information. Adv. Neural Inform. Processing Systems, vol. 20 (Curran Associates, Inc.), 1729–1736.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.