Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach

Haotian Gu
Haotian Gu
[email protected]
https://orcid.org/0000-0002-0268-7147
Department of Mathematics, University of California, Berkeley, Berkeley, California 94720
Search for more papers by this author
,
Xin Guo
Corresponding Author
Xin Guo
[email protected]
https://orcid.org/0000-0002-3350-4606
Department of Industrial Engineering & Operations Research, University of California, Berkeley, Berkeley, California 94720
Search for more papers by this author
,
Xiaoli Wei
Xiaoli Wei
[email protected]
https://orcid.org/0000-0002-4787-2856
Tsinghua Shenzhen International Graduate School, Shenzhen 518071, China
Search for more papers by this author
,
Renyuan Xu
Renyuan Xu
[email protected]
https://orcid.org/0000-0003-4293-3450
Industrial & Systems Engineering, University of Southern California, Los Angeles, California 90089
Search for more papers by this author

Department of Mathematics, University of California, Berkeley, Berkeley, California 94720

Corresponding Author

Xin Guo

Department of Industrial Engineering & Operations Research, University of California, Berkeley, Berkeley, California 94720

Search for more papers by this author

Xiaoli Wei

[email protected]

https://orcid.org/0000-0002-4787-2856

Tsinghua Shenzhen International Graduate School, Shenzhen 518071, China

Search for more papers by this author

Renyuan Xu

[email protected]

https://orcid.org/0000-0003-4293-3450

Industrial & Systems Engineering, University of Southern California, Los Angeles, California 90089

Search for more papers by this author

Published Online:13 Mar 2024https://doi.org/10.1287/moor.2022.0055

References

[1] Agarwal A, Kakade SM, Lee JD, Mahajan G (2021) On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Machine Learn. Res. 22(98):1–76.Google Scholar
[2] Aïd R, Dumitrescu R, Tankov P (2021) The entry and exit game in the electricity markets: A mean-field game approach. J. Dynamics Games 8(4):331–358.Crossref, Google Scholar
[3] Allen-Zhu Z, Li Y, Liang Y (2019) Learning and generalization in overparameterized neural networks, going beyond two layers. Adv. Neural Inform. Processing Systems 32:6158–6169.Google Scholar
[4] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 242–252.Google Scholar
[5] Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Bubeck S, Perchet, V, Rigollet, P, eds. Conf. Learn. Theory, vol. 75 (PMLR, New York), 1691–1692.Google Scholar
[6] Cabannes T, Lauriere M, Perolat J, Marinier R, Girgin S, Perrin S, Pietquin O, Bayen AM, Goubault E, Elie R (2021) Solving N-player dynamic routing games with congestion: A mean-field approach. Preprint, submitted October 22, https://arxiv.org/abs/2110.11943.Google Scholar
[7] Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Adv. Neural Inform. Processing Systems 32:11315–11326.Google Scholar
[8] Calderone D, Sastry SS (2017) Markov decision process routing games. Internat. Conf. Cyber-Physical Systems (IEEE, Piscataway, NJ), 273–280.Google Scholar
[9] Cao Y, Yu W, Ren W, Chen G (2012) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Indust. Informatics 9(1):427–438.Crossref, Google Scholar
[10] Carmona R, Fouque JP, Sun LH (2015) Mean-field games and systemic risk. Comm. Math. Sci. 13(4):911–933.Crossref, Google Scholar
[11] Carmona R, Laurière M, Tan Z (2019) Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. Preprint, submitted October 9, https://arxiv.org/abs/1910.04295.Google Scholar
[12] Carmona R, Laurière M, Tan Z (2023) Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning. Ann. Appl. Probab. 33(6B):5334–5381.Crossref, Google Scholar
[13] Casgrain P, Jaimungal S (2020) Mean-field games with differing beliefs for algorithmic trading. Math. Finance 30(3):995–1034.Crossref, Google Scholar
[14] Cayci S, Satpathi S, He N, Srikant R (2023) Sample complexity and overparameterization bounds for projection-free neural TD learning. IEEE Trans. Automatic Control 68(5):2891–2905.Crossref, Google Scholar
[15] Chen T, Zhang K, Giannakis GB, Basar T (2022) Communication-efficient policy gradient methods for distributed reinforcement learning. Hennequin PL, ed. IEEE Trans. Control Network Systems 9(2):917–929.Crossref, Google Scholar
[16] Dawson D (1993) Measure-valued Markov processes. Ecole d’Eté de Probabilités de Saint-Flour. XXI-1991 (Springer, Berlin, Heidelberg), 1–260.Google Scholar
[17] El-Tantawy S, Abdulhai B, Abdelgawad H (2013) Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale application on downtown Toronto. IEEE Trans. Intelligent Transportation Systems 14(3):1140–1150.Crossref, Google Scholar
[18] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. McIlraith SA, Weinberger KQ, eds. AAAI Conf. Artificial Intelligence, vol. 32 (AAAI Press, Palo Alto, CA), 2974–2982.Google Scholar
[19] Fu Z, Yang Z, Wang Z (2020) Single-timescale actor-critic provably finds globally optimal policy. Internat. Conf. Learn. Representations.Google Scholar
[20] Gamarnik D (2013) Correlation decay method for decision, optimization, and inference in large-scale networks. Theory Driven by Influential Applications (INFORMS, Catonsville, MD), 108–121.Link, Google Scholar
[21] Gamarnik D, Goldberg DA, Weber T (2014) Correlation decay in random decision networks. Math. Oper. Res. 39(2):229–261.Link, Google Scholar
[22] Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations Trends Machine Learn. 6(4):375–451.Crossref, Google Scholar
[23] Germain M, Pham H, Warin X (2023) A level-set approach to the control of state-constrained McKean-Vlasov equations: Application to renewable energy storage and portfolio selection. Numerical Algebra Control Optim. 14(3–4):555–582.Crossref, Google Scholar
[24] Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Teh YW, Titterington M, eds. Internat. Conf. Artificial Intelligence Statist., vol. 8 (PMLR, New York), 249–256.Google Scholar
[25] Gu H, Guo X, Wei X, Xu R (2021) Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM J. Math. Data Sci. 3(4):1168–1196.Crossref, Google Scholar
[26] Gu H, Guo X, Wei X, Xu R (2023) Dynamic programming principles for mean-field controls with learning. Oper. Res. 71(4):1040–1054.Link, Google Scholar
[27] Guériau M, Dusparic I (2018) SAMoD: Shared autonomous mobility-on-demand using decentralized reinforcement learning. Internat. Conf. Intelligent Transportation Systems (IEEE, Piscataway, NJ), 1558–1563.Google Scholar
[28] Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Adv. Neural Inform. Processing Systems 32:4966–4976.Google Scholar
[29] Hu R, Zariphopoulou T (2022) N-player and mean-field games in Itô-diffusion markets with competitive or homophilous interaction. Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions (Springer, Berlin, Heidelberg), 209–237.Google Scholar
[30] Hüttenrauch M, Šošić A, Neumann G (2017) Guided deep reinforcement learning for swarm systems. Preprint, submitted September 18, https://arxiv.org/abs/1709.06011.Google Scholar
[31] Iyer K, Johari R, Sundararajan M (2014) Mean-field equilibria of dynamic auctions with learning. Management Sci. 60(12):2949–2970.Link, Google Scholar
[32] Ji Z, Telgarsky M, Xian R (2020) Neural tangent kernels, transportation mappings, and universal approximation. Internat. Conf. Learn. Representations.Google Scholar
[33] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
[34] Jin J, Song C, Li H, Gai K, Wang J, Zhang W (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 2193–2201.Google Scholar
[35] Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 267–274.Google Scholar
[36] Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
[37] Lacker D, Zariphopoulou T (2019) Mean-field and N-agent games for optimal investment under relative performance criteria. Math. Finance 29(4):1003–1038.Crossref, Google Scholar
[38] Li Y, Tang Y, Zhang R, Li N (2021) Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach. IEEE Trans. Automatic Control 67(12):6429–6444.Crossref, Google Scholar
[39] Li M, Qin Z, Jiao Y, Yang Y, Wang J, Wang C, Wu G, Ye J (2019) Efficient ridesharing order dispatching with mean-field multi-agent reinforcement learning. World Wide Web Conf. (ACM, New York), 983–994.Google Scholar
[40] Lin Y, Qu G, Huang L, Wierman A (2021) Multi-agent reinforcement learning in stochastic networked systems. Adv. Neural Inform. Processing Systems 34:7825–7837.Google Scholar
[41] Liu B, Cai Q, Yang Z, Wang Z (2019) Neural trust region/proximal policy optimization attains globally optimal policy. Adv. Neural Inform. Processing Systems 32:10565–10576.Google Scholar
[42] Liu Y, Swaminathan A, Agarwal A, Brunskill E (2019) Off-policy policy gradient with stationary distribution correction. Globerson A, Hoffmann AG, eds. Conf. Uncertainty Artificial Intelligence, vol. 115 (PMLR, New York), 1180–1190.Google Scholar
[43] Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inform. Processing Systems 30:6382–6393.Google Scholar
[44] Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J. Machine Learn. Res. 7(12):2651–2667.Google Scholar
[45] Motte M, Pham H (2022) Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab. 32(2):1421–1458.Crossref, Google Scholar
[46] Pirotta M, Restelli M, Bascetta L (2015) Policy gradient in Lipschitz Markov decision processes. Machine Learn. 100(2):255–283.Crossref, Google Scholar
[47] Qin ZT, Zhu H, Ye J (2022) Reinforcement learning for ridesharing: An extended survey. Transportation Res. Part C Emerging Tech. 144:103852.Crossref, Google Scholar
[48] Qu G, Wierman A, Li N (2020) Scalable reinforcement learning of localized policies for multi-agent networked systems. Learning for Dynamics and Control, vol. 120 (PMLR, New York), 256–266.Google Scholar
[49] Rabbat M, Nowak R (2004) Distributed optimization in sensor networks. Internat. Sympos. Inform. Processing Sensor Networks (IEEE, Piscataway, NJ), 20–27.Google Scholar
[50] Rahimi A, Recht B (2008) Uniform approximation of functions with random bases. Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 555–561.Google Scholar
[51] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. Internat. Conf. Machine Learn., vol. 21(1) (PMLR, New York), 4295–4304.Google Scholar
[52] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
[53] Sra S, Nowozin S, Wright SJ (2012) Optimization for Machine Learning (MIT Press, Cambridge, MA).Google Scholar
[54] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. Andre E, Koenig S, eds. Internat. Conf. Autonomous Agents Multi-agent Systems, vol. 3 (ACM, New York), 2085–2087.Google Scholar
[55] Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inform. Processing Systems 99:1057–1063.Google Scholar
[56] Vadori N, Ganesh S, Reddy P, Veloso M (2020) Calibration of shared equilibria in general sum partially observable Markov games. Adv. Neural Inform. Processing Systems 33:14118–14128.Google Scholar
[57] Wang L, Cai Q, Yang Z, Wang Z (2020) Neural policy gradient methods: Global optimality and rates of convergence. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
[58] Xu P, Gao F, Gu Q (2019) Sample efficient policy gradient methods with recursive variance reduction. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
[59] Xu P, Gao F, Gu Q (2020) An improved convergence analysis of stochastic variance-reduced policy gradient. Adams RP, Gogate V, eds. Conf. Uncertainty Artificial Intelligence, vol. 115 (PMLR, New York), 541–551.Google Scholar
[60] Yang Y, Wen Y, Wang J, Chen L, Shao K, Mguni D, Zhang W (2020) Multi-agent determinantal Q-learning. Internat. Conf. Machine Learn. (PMLR, New York), 10757–10766.Google Scholar
[61] Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, Fan C, Wei Z (2020) Q-value path decomposition for deep multiagent reinforcement learning. Daumé H, Singh A, eds. Internat. Conf. Machine Learn. (PMLR, New York), 10706–10715.Google Scholar
[62] You X, Li X, Xu Y, Feng H, Zhao J, Yan H (2020) Toward packet routing with fully distributed multiagent deep reinforcement learning. IEEE Trans. Systems Man Cybernetics Systems 52(2):855–868.Crossref, Google Scholar
[63] Zhang K, Yang Z, Basar T (2018) Networked multi-agent reinforcement learning in continuous spaces. Conf. Decision Control (IEEE, Piscataway, NJ), 2771–2776.Google Scholar
[64] Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, Chapter 12 (Springer, Cham, Switzerland), 321–384.Crossref, Google Scholar
[65] Zhang K, Koppel A, Zhu H, Basar T (2020) Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control Optim. 58(6):3586–3612.Crossref, Google Scholar
[66] Zhang K, Liu Y, Liu J, Liu M, Başar T (2020) Distributed learning of average belief over networks using sequential observations. Automatica J. IFAC 115:108857.Crossref, Google Scholar
[67] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. Dy J, Krause A, eds. Internat. Conf. Machine Learn. (PMLR, New York), 5872–5881.Google Scholar
[68] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2021) Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents. IEEE Trans. Automatic Control 66(12):5925–5940.Crossref, Google Scholar
[69] Zheng S, Trott A, Srinivasa S, Naik N, Gruesbeck M, Parkes DC, Socher R (2020) The AI economist: Improving equality and productivity with AI-driven tax policies. Preprint, submitted April 28, https://arxiv.org/abs/2004.13332.Google Scholar
[70] Zhou Z, Mertikopoulos P, Moustakas AL, Bambos N, Glynn P (2021) Robust power management via learning and game design. Oper. Res. 69(1):331–345.Link, Google Scholar
[71] Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Adv. Neural Inform. Processing Systems 32:2055–2064.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 1

February 2025

Pages 1-781 C2

Article Information

Metrics

Information

Received:February 15, 2022
Accepted:January 01, 2024
Published Online:March 13, 2024

Cite as

Haotian Gu; , Xin Guo; , Xiaoli Wei; , Renyuan Xu (2024) Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach. Mathematics of Operations Research 50(1):506-536.

https://doi.org/10.1287/moor.2022.0055

Keywords

Acknowledgment

The authors express their gratitude to the area editor, the associate editor, and three anonymous reviewers for their insightful comments, which significantly contributed to the improvement of our paper.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach

References

Volume 50, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News