V-Learning—A Simple, Efficient, Decentralized Algorithm for Multiagent Reinforcement Learning
References
- [1] (2017) Open problem: First-order regret bounds for contextual bandits. Kale S, Shamir O, eds. Conf. Learn. Theory, vol. 65 (PMLR, Cambridge, MA), 4–7.Google Scholar
- [2] (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 263–272.Google Scholar
- [3] (2020) Provable self-play algorithms for competitive reinforcement learning. Bai Y, Jin C, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 551–560.Google Scholar
- [4] (2020) Near-optimal reinforcement learning with self-play. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 2159–2170.Google Scholar
- [5] (2020) Emergent tool use from multi-agent autocurricula. 8th Internat. Conf. Learn. Representations.Google Scholar
- [6] (2007) From external to internal regret. J. Machine Learn. Res. 8(6):1307–1324.Google Scholar
- [7] (2013) Swarm robotics: A review from the swarm engineering perspective. Swarm Intelligence 7(1):1–41.Crossref, Google Scholar
- [8] (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.Crossref, Google Scholar
- [9] (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890.Crossref, Google Scholar
- [10] (2020) No-regret learning dynamics for extensive-form correlated equilibrium. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 7722–7732.Google Scholar
- [11] (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [12] (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 5713–5723.Google Scholar
- [13] (2013) On the complexity of approximating a Nash equilibrium. ACM Trans. Algorithms 9(3):23.Crossref, Google Scholar
- [14] (2021) Independent policy gradient methods for competitive reinforcement learning. Larochelle H, Ranzato MA, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems Foundation, La Jolla, CA), 5527–5540.Google Scholar
- [15] (2022) The complexity of Markov equilibrium in stochastic games. Neu G, Rosasco L, eds. 36th Annual Conf. Theory, vol. 195 (PMLR, Cambridge, MA), 4180–4234.Google Scholar
- [16] (2022) Regret minimization and convergence to equilibria in general-sum Markov games. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Internat. Conf. Machine Learn., vol. 202 (PMLR, Cambridge, MA), 9343–9373.Google Scholar
- [17] (2006) Finding equilibria in large sequential games of imperfect information. Feigenbaum J, Chuang JCI, Pennock DM, eds. Proc. 7th ACM Conf. Electronic Commerce (ACM), 160–169.Google Scholar
- [18] (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar
- [19] (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150.Crossref, Google Scholar
- [20] (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4(November):1039–1069.Google Scholar
- [21] (2019) Actor-attention-critic for multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 2961–2970.Google Scholar
- [22] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
- [23] (2019) Feature-based Q-learning for two-player stochastic games. Preprint, submitted June 2, https://arxiv.org/abs/1906.00423.Google Scholar
- [24] (2017) Contextual decision processes with low bellman rank are PAC-learnable. Precup D, Teh YW, eds. Internat. Conf. Machine Learn., vol. 70 (PMLR, Cambridge, MA), 1704–1713.Google Scholar
- [25] (2021) Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Ranzato MA, Beygelzimer B, Dauphin YN, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Neural Information Processing Systems Foundation, La Jolla, CA), 13406–13418.Google Scholar
- [26] (2021) The power of exploiter: Provable multi-agent RL in large state spaces. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S. eds. Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 10251–10279.Google Scholar
- [27] (2018) Is Q-learning provably efficient? Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Neural Information Processing Systems Foundation, La Jolla, CA), 4868–4878.Google Scholar
- [28] (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy JD, Agarwal S, eds. Conf. Learn. Theory, vol. 125 (PMLR, Cambridge, MA), 2137–2143.Google Scholar
- [29] (2022) Decentralized cooperative reinforcement learning with hierarchical information structure. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, vol. 167 (PMLR, Cambridge, MA), 573–605.Google Scholar
- [30] (1992) The complexity of two-person zero-sum games in extensive form. Games Econom. Behav. 4(4):528–552.Crossref, Google Scholar
- [31] (2018) Bandit algorithms. Working paper, Cambridge University Press, Cambridge.Google Scholar
- [32] (1994) Markov games as a framework for multi-agent reinforcement learning. Cohen WW, Hirsh H, eds. Machine Learning Proceedings 1994 (Morgan Kaufmann, Elsevier, Amsterdam), 157–163.Crossref, Google Scholar
- [33] (2001) Friend-or-foe Q-learning in general-sum games. Brodley CE, Danyluk AP, eds. Proc. 18th Internat. Conf. Machine Learn. (Morgan Kaufmann, Elsevier, Amsterdam), 322–328.Google Scholar
- [34] (2022) Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 14036–14053.Google Scholar
- [35] (2021) A sharp analysis of model-based reinforcement learning with self-play. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 7001–7010.Google Scholar
- [36] (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 6379–6390.Google Scholar
- [37] (2023) Provably efficient reinforcement learning in decentralized general-sum Markov games. Dyn. Games Appl. 13(1):165–186.Google Scholar
- [38] (2022) On improving model-free algorithms for decentralized multi-agent reinforcement learning. Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S, eds, Internat. Conf. Machine Learn., vol. 162 (PMLR, Cambridge, MA), 15007–15049.Google Scholar
- [39] (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 28 (Neural Information Processing Systems Foundation, La Jolla, CA), 3168–3176.Google Scholar
- [40] (2018) Open AI five. Accessed October 21, 2023, https://blog.openai.com/openai-five/.Google Scholar
- [41] (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
- [42] (2014) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Internat. Conf. Machine Learn., vol. 48 (JMLR), 2377–2386.Google Scholar
- [43] (1994) A Course in Game Theory (MIT Press, Cambridge, MA).Google Scholar
- [44] (2019) Learning to collaborate in Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5261–5270.Google Scholar
- [45] (2018) Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 4295–4304.Google Scholar
- [46] (2021) Decentralized Q-learning in zero-sum Markov games. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Neural Inform. Processing Systems (Neural Information Processing Systems Foundation, La Jolla, CA), 18320–18334.Google Scholar
- [47] (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
- [48] (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar
- [49] (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. Chiappa S, Calandra R, eds. Internat. Conf. Artificial Intelligence Statist., vol. 108 (PMLR, Cambridge, MA), 2992–3002.Google Scholar
- [50] (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
- [51] (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
- [52] (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, Cambridge, MA), 5887–5896.Google Scholar
- [53] (2021) When can we learn general-sum Markov games with a large number of players sample-efficiently? 10th Internat. Conf. Learn. Representations.Google Scholar
- [54] (2006) PAC model-free reinforcement learning. Cohen WW, Moore AW, eds. Internat. Conf. Machine Learn., vol. 148 (ACM), 881–888.Google Scholar
- [55] (2018) Value-decomposition networks for cooperative multi-agent learning. André E, Koenig S, Dastani M, Sukthankar G, eds. Internat. Conf. Autonomous Agents and MultiAgent Systems (ACM), 2085–2087.Google Scholar
- [56] (2021) Online learning in unknown Markov games. Meila M, Zhang T, eds. Internat. Conf. Machine Learn., vol. 139 (PMLR, Cambridge, MA), 10279–10288.Google Scholar
- [57] (2019) Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354.Crossref, Google Scholar
- [58] (2021) Analysis of UCSG in the finite-horizon setting.Google Scholar
- [59] (2017) Online reinforcement learning in stochastic games. Guyon I, Luxburg Uv, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Neural Information Processing Systems Foundation, La Jolla, CA), 4987–4997.Google Scholar
- [60] (2021) Linear last-iterate convergence in constrained saddle-point optimization. Internat. Conf. Learn. Representations.Google Scholar
- [61] (2021) Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. Belkin M, Kpotufe S, eds. Conf. Learn. Theory, vol. 134 (PMLR, Cambridge, MA), 4259–4299.Google Scholar
- [62] (2023) Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. Math. Oper. Res. 48(1):433–462.Google Scholar
- [63] (2020) Learning near optimal policies with low inherent bellman error. Internat. Conf. Machine Learn., vol. 119 (PMLR, Cambridge, MA), 10978–10989.Google Scholar
- [64] (2023) Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in Markov games. Internat. Conf. Learn. Representations.Google Scholar
- [65] (2023) Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. J. Mach. Learn. Res. 24:175.Google Scholar
- [66] (2018) Fully decentralized multi-agent reinforcement learning with networked agents. Dy JG, Krause A, eds. Internat. Conf. Machine Learn., vol. 80 (PMLR, Cambridge, MA), 5867–5876.Google Scholar
- [67] (2007) Regret minimization in games with incomplete information. Adv. Neural Inform. Processing Systems, vol. 20 (Curran Associates, Inc.), 1729–1736.Google Scholar

