Mean-Field Multiagent Reinforcement Learning: A Decentralized Network Approach

Published Online:https://doi.org/10.1287/moor.2022.0055

References

  • [1] Agarwal A, Kakade SM, Lee JD, Mahajan G (2021) On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Machine Learn. Res. 22(98):1–76.Google Scholar
  • [2] Aïd R, Dumitrescu R, Tankov P (2021) The entry and exit game in the electricity markets: A mean-field game approach. J. Dynamics Games 8(4):331–358.CrossrefGoogle Scholar
  • [3] Allen-Zhu Z, Li Y, Liang Y (2019) Learning and generalization in overparameterized neural networks, going beyond two layers. Adv. Neural Inform. Processing Systems 32:6158–6169.Google Scholar
  • [4] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 242–252.Google Scholar
  • [5] Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Bubeck S, Perchet, V, Rigollet, P, eds. Conf. Learn. Theory, vol. 75 (PMLR, New York), 1691–1692.Google Scholar
  • [6] Cabannes T, Lauriere M, Perolat J, Marinier R, Girgin S, Perrin S, Pietquin O, Bayen AM, Goubault E, Elie R (2021) Solving N-player dynamic routing games with congestion: A mean-field approach. Preprint, submitted October 22, https://arxiv.org/abs/2110.11943.Google Scholar
  • [7] Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Adv. Neural Inform. Processing Systems 32:11315–11326.Google Scholar
  • [8] Calderone D, Sastry SS (2017) Markov decision process routing games. Internat. Conf. Cyber-Physical Systems (IEEE, Piscataway, NJ), 273–280.Google Scholar
  • [9] Cao Y, Yu W, Ren W, Chen G (2012) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Indust. Informatics 9(1):427–438.CrossrefGoogle Scholar
  • [10] Carmona R, Fouque JP, Sun LH (2015) Mean-field games and systemic risk. Comm. Math. Sci. 13(4):911–933.CrossrefGoogle Scholar
  • [11] Carmona R, Laurière M, Tan Z (2019) Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. Preprint, submitted October 9, https://arxiv.org/abs/1910.04295.Google Scholar
  • [12] Carmona R, Laurière M, Tan Z (2023) Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning. Ann. Appl. Probab. 33(6B):5334–5381.CrossrefGoogle Scholar
  • [13] Casgrain P, Jaimungal S (2020) Mean-field games with differing beliefs for algorithmic trading. Math. Finance 30(3):995–1034.CrossrefGoogle Scholar
  • [14] Cayci S, Satpathi S, He N, Srikant R (2023) Sample complexity and overparameterization bounds for projection-free neural TD learning. IEEE Trans. Automatic Control 68(5):2891–2905.CrossrefGoogle Scholar
  • [15] Chen T, Zhang K, Giannakis GB, Basar T (2022) Communication-efficient policy gradient methods for distributed reinforcement learning. Hennequin PL, ed. IEEE Trans. Control Network Systems 9(2):917–929.CrossrefGoogle Scholar
  • [16] Dawson D (1993) Measure-valued Markov processes. Ecole d’Eté de Probabilités de Saint-Flour. XXI-1991 (Springer, Berlin, Heidelberg), 1–260.Google Scholar
  • [17] El-Tantawy S, Abdulhai B, Abdelgawad H (2013) Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale application on downtown Toronto. IEEE Trans. Intelligent Transportation Systems 14(3):1140–1150.CrossrefGoogle Scholar
  • [18] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. McIlraith SA, Weinberger KQ, eds. AAAI Conf. Artificial Intelligence, vol. 32 (AAAI Press, Palo Alto, CA), 2974–2982.Google Scholar
  • [19] Fu Z, Yang Z, Wang Z (2020) Single-timescale actor-critic provably finds globally optimal policy. Internat. Conf. Learn. Representations.Google Scholar
  • [20] Gamarnik D (2013) Correlation decay method for decision, optimization, and inference in large-scale networks. Theory Driven by Influential Applications (INFORMS, Catonsville, MD), 108–121.LinkGoogle Scholar
  • [21] Gamarnik D, Goldberg DA, Weber T (2014) Correlation decay in random decision networks. Math. Oper. Res. 39(2):229–261.LinkGoogle Scholar
  • [22] Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations Trends Machine Learn. 6(4):375–451.CrossrefGoogle Scholar
  • [23] Germain M, Pham H, Warin X (2023) A level-set approach to the control of state-constrained McKean-Vlasov equations: Application to renewable energy storage and portfolio selection. Numerical Algebra Control Optim. 14(3–4):555–582.CrossrefGoogle Scholar
  • [24] Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Teh YW, Titterington M, eds. Internat. Conf. Artificial Intelligence Statist., vol. 8 (PMLR, New York), 249–256.Google Scholar
  • [25] Gu H, Guo X, Wei X, Xu R (2021) Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM J. Math. Data Sci. 3(4):1168–1196.CrossrefGoogle Scholar
  • [26] Gu H, Guo X, Wei X, Xu R (2023) Dynamic programming principles for mean-field controls with learning. Oper. Res. 71(4):1040–1054.LinkGoogle Scholar
  • [27] Guériau M, Dusparic I (2018) SAMoD: Shared autonomous mobility-on-demand using decentralized reinforcement learning. Internat. Conf. Intelligent Transportation Systems (IEEE, Piscataway, NJ), 1558–1563.Google Scholar
  • [28] Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Adv. Neural Inform. Processing Systems 32:4966–4976.Google Scholar
  • [29] Hu R, Zariphopoulou T (2022) N-player and mean-field games in Itô-diffusion markets with competitive or homophilous interaction. Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions (Springer, Berlin, Heidelberg), 209–237.Google Scholar
  • [30] Hüttenrauch M, Šošić A, Neumann G (2017) Guided deep reinforcement learning for swarm systems. Preprint, submitted September 18, https://arxiv.org/abs/1709.06011.Google Scholar
  • [31] Iyer K, Johari R, Sundararajan M (2014) Mean-field equilibria of dynamic auctions with learning. Management Sci. 60(12):2949–2970.LinkGoogle Scholar
  • [32] Ji Z, Telgarsky M, Xian R (2020) Neural tangent kernels, transportation mappings, and universal approximation. Internat. Conf. Learn. Representations.Google Scholar
  • [33] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
  • [34] Jin J, Song C, Li H, Gai K, Wang J, Zhang W (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 2193–2201.Google Scholar
  • [35] Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 267–274.Google Scholar
  • [36] Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
  • [37] Lacker D, Zariphopoulou T (2019) Mean-field and N-agent games for optimal investment under relative performance criteria. Math. Finance 29(4):1003–1038.CrossrefGoogle Scholar
  • [38] Li Y, Tang Y, Zhang R, Li N (2021) Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach. IEEE Trans. Automatic Control 67(12):6429–6444.CrossrefGoogle Scholar
  • [39] Li M, Qin Z, Jiao Y, Yang Y, Wang J, Wang C, Wu G, Ye J (2019) Efficient ridesharing order dispatching with mean-field multi-agent reinforcement learning. World Wide Web Conf. (ACM, New York), 983–994.Google Scholar
  • [40] Lin Y, Qu G, Huang L, Wierman A (2021) Multi-agent reinforcement learning in stochastic networked systems. Adv. Neural Inform. Processing Systems 34:7825–7837.Google Scholar
  • [41] Liu B, Cai Q, Yang Z, Wang Z (2019) Neural trust region/proximal policy optimization attains globally optimal policy. Adv. Neural Inform. Processing Systems 32:10565–10576.Google Scholar
  • [42] Liu Y, Swaminathan A, Agarwal A, Brunskill E (2019) Off-policy policy gradient with stationary distribution correction. Globerson A, Hoffmann AG, eds. Conf. Uncertainty Artificial Intelligence, vol. 115 (PMLR, New York), 1180–1190.Google Scholar
  • [43] Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inform. Processing Systems 30:6382–6393.Google Scholar
  • [44] Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J. Machine Learn. Res. 7(12):2651–2667.Google Scholar
  • [45] Motte M, Pham H (2022) Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab. 32(2):1421–1458.CrossrefGoogle Scholar
  • [46] Pirotta M, Restelli M, Bascetta L (2015) Policy gradient in Lipschitz Markov decision processes. Machine Learn. 100(2):255–283.CrossrefGoogle Scholar
  • [47] Qin ZT, Zhu H, Ye J (2022) Reinforcement learning for ridesharing: An extended survey. Transportation Res. Part C Emerging Tech. 144:103852.CrossrefGoogle Scholar
  • [48] Qu G, Wierman A, Li N (2020) Scalable reinforcement learning of localized policies for multi-agent networked systems. Learning for Dynamics and Control, vol. 120 (PMLR, New York), 256–266.Google Scholar
  • [49] Rabbat M, Nowak R (2004) Distributed optimization in sensor networks. Internat. Sympos. Inform. Processing Sensor Networks (IEEE, Piscataway, NJ), 20–27.Google Scholar
  • [50] Rahimi A, Recht B (2008) Uniform approximation of functions with random bases. Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 555–561.Google Scholar
  • [51] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. Internat. Conf. Machine Learn., vol. 21(1) (PMLR, New York), 4295–4304.Google Scholar
  • [52] Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
  • [53] Sra S, Nowozin S, Wright SJ (2012) Optimization for Machine Learning (MIT Press, Cambridge, MA).Google Scholar
  • [54] Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. Andre E, Koenig S, eds. Internat. Conf. Autonomous Agents Multi-agent Systems, vol. 3 (ACM, New York), 2085–2087.Google Scholar
  • [55] Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inform. Processing Systems 99:1057–1063.Google Scholar
  • [56] Vadori N, Ganesh S, Reddy P, Veloso M (2020) Calibration of shared equilibria in general sum partially observable Markov games. Adv. Neural Inform. Processing Systems 33:14118–14128.Google Scholar
  • [57] Wang L, Cai Q, Yang Z, Wang Z (2020) Neural policy gradient methods: Global optimality and rates of convergence. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • [58] Xu P, Gao F, Gu Q (2019) Sample efficient policy gradient methods with recursive variance reduction. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • [59] Xu P, Gao F, Gu Q (2020) An improved convergence analysis of stochastic variance-reduced policy gradient. Adams RP, Gogate V, eds. Conf. Uncertainty Artificial Intelligence, vol. 115 (PMLR, New York), 541–551.Google Scholar
  • [60] Yang Y, Wen Y, Wang J, Chen L, Shao K, Mguni D, Zhang W (2020) Multi-agent determinantal Q-learning. Internat. Conf. Machine Learn. (PMLR, New York), 10757–10766.Google Scholar
  • [61] Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, Fan C, Wei Z (2020) Q-value path decomposition for deep multiagent reinforcement learning. Daumé H, Singh A, eds. Internat. Conf. Machine Learn. (PMLR, New York), 10706–10715.Google Scholar
  • [62] You X, Li X, Xu Y, Feng H, Zhao J, Yan H (2020) Toward packet routing with fully distributed multiagent deep reinforcement learning. IEEE Trans. Systems Man Cybernetics Systems 52(2):855–868.CrossrefGoogle Scholar
  • [63] Zhang K, Yang Z, Basar T (2018) Networked multi-agent reinforcement learning in continuous spaces. Conf. Decision Control (IEEE, Piscataway, NJ), 2771–2776.Google Scholar
  • [64] Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, Chapter 12 (Springer, Cham, Switzerland), 321–384.CrossrefGoogle Scholar
  • [65] Zhang K, Koppel A, Zhu H, Basar T (2020) Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control Optim. 58(6):3586–3612.CrossrefGoogle Scholar
  • [66] Zhang K, Liu Y, Liu J, Liu M, Başar T (2020) Distributed learning of average belief over networks using sequential observations. Automatica J. IFAC 115:108857.CrossrefGoogle Scholar
  • [67] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. Dy J, Krause A, eds. Internat. Conf. Machine Learn. (PMLR, New York), 5872–5881.Google Scholar
  • [68] Zhang K, Yang Z, Liu H, Zhang T, Basar T (2021) Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents. IEEE Trans. Automatic Control 66(12):5925–5940.CrossrefGoogle Scholar
  • [69] Zheng S, Trott A, Srinivasa S, Naik N, Gruesbeck M, Parkes DC, Socher R (2020) The AI economist: Improving equality and productivity with AI-driven tax policies. Preprint, submitted April 28, https://arxiv.org/abs/2004.13332.Google Scholar
  • [70] Zhou Z, Mertikopoulos P, Moustakas AL, Bambos N, Glynn P (2021) Robust power management via learning and game design. Oper. Res. 69(1):331–345.LinkGoogle Scholar
  • [71] Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Adv. Neural Inform. Processing Systems 32:2055–2064.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.