Dynamic Programming Principles for Mean-Field Controls with Learning

Haotian Gu
Haotian Gu
[email protected]
https://orcid.org/0000-0002-0268-7147
Department of Mathematics, University of California, Berkeley, California 94720;
Search for more papers by this author
,
Xin Guo
Corresponding Author
Xin Guo
[email protected]
https://orcid.org/0000-0002-3350-4606
Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720;
Search for more papers by this author
,
Xiaoli Wei
Xiaoli Wei
[email protected]
https://orcid.org/0000-0002-4787-2856
Tsinghua-Berkeley Shenzhen Institute, Shenzen 518055, China;
Search for more papers by this author
,
Renyuan Xu
Renyuan Xu
[email protected]
https://orcid.org/0000-0003-4293-3450
Industrial and Systems Engineering, University of Southern California, Los Angeles, California 90001
Search for more papers by this author

Department of Mathematics, University of California, Berkeley, California 94720;

Search for more papers by this author

Xin Guo

Corresponding Author

Xin Guo

[email protected]

https://orcid.org/0000-0002-3350-4606

Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720;

Search for more papers by this author

Xiaoli Wei

[email protected]

https://orcid.org/0000-0002-4787-2856

Tsinghua-Berkeley Shenzhen Institute, Shenzen 518055, China;

Search for more papers by this author

Renyuan Xu

[email protected]

https://orcid.org/0000-0003-4293-3450

Industrial and Systems Engineering, University of Southern California, Los Angeles, California 90001

Search for more papers by this author

Published Online:12 Jan 2023https://doi.org/10.1287/opre.2022.2395

References

Aïd R, Basei M, Pham H (2020) A McKean–Vlasov approach to distributed electricity generation development. Math. Methods Oper. Res. 91(2):269–310.Crossref, Google Scholar
Andersson D, Djehiche B (2011) A maximum principle for SDEs of mean-field type. Appl. Math. Optim. 63(3):341–356.Crossref, Google Scholar
Bellman R (1957) A Markovian decision process. J. Math. Mechanics 6(5):679–684.Google Scholar
Bensoussan A, Frehse J, Yam P (2013) Mean Field Games and Mean Field Type Control Theory, SpringerBriefs in Mathematics, vol. 101 (Springer, New York).Crossref, Google Scholar
Bertsekas DP, Shreve SE (1978) Stochastic Optimal Control: The Discrete-Time Case, Mathematics in Science and Engineering, vol. 139 (Academic Press, New York).Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Buckdahn R, Djehiche B, Li J (2011) A general stochastic maximum principle for SDEs of mean-field type. Appl. Math. Optim. 64(2):197–216.Crossref, Google Scholar
Carmona R, Delarue F (2015) Forward–backward stochastic differential equations and controlled McKean–Vlasov dynamics. Ann. Probab. 43(5):2647–2700.Crossref, Google Scholar
Carmona R, Delarue F (2018a) Probabilistic Theory of Mean Field Games with Applications I, Probability Theory and Stochastic Modelling, vol. 83 (Springer, Cham, Switzerland).Google Scholar
Carmona R, Delarue F (2018b) Probabilistic Theory of Mean Field Games with Applications II, Probability Theory and Stochastic Modelling, vol. 84 (Springer, Cham, Switzerland).Google Scholar
Carmona R, Laurière M, Tan Z (2019a) Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. Preprint, submitted October 9, https://arxiv.org/abs/1910.04295.Google Scholar
Carmona R, Laurière M, Tan Z (2019b) Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning. Preprint, submitted October 28, https://arxiv.org/abs/1910.12802.Google Scholar
Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. Proc. 15th Natl./10th Conf. Artificial Intelligence/Innovative Appl. Artificial Intelligence (AAAI Press, Palo Alto, CA), 761–768.Google Scholar
Djete MF, Possamaï D, Tan X (2019) McKean-Vlasov optimal control: The dynamic programming principle. Preprint, submitted July 20, https://arxiv.org/abs/1907.08860.Google Scholar
Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput. 12(1):219–245.Crossref, Google Scholar
Doya K, Samejima K, Katagiri K-i, Kawato M (2002) Multiple model-based reinforcement learning. Neural Comput. 14(6):1347–1369.Crossref, Google Scholar
El-Tantawy S, Abdulhai B, Abdelgawad H (2013) Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale application on downtown Toronto. IEEE Trans. Intelligent Transportation Systems 14(3):1140–1150.Crossref, Google Scholar
Even-Dar E, Mansour Y, Bartlett P (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(1):1–25.Google Scholar
Fleming WH, Soner HM (2006) Controlled Markov Processes and Viscosity Solutions, Stochastic Modelling and Applied Probability, vol. 25. (Springer Science & Business Media, New York).Google Scholar
Garnier J, Papanicolaou G, Yang TW (2013) Large deviations for a mean field model of systemic risk. SIAM J. Financial Math. 4(1):151–184.Crossref, Google Scholar
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Internat. Statist. Rev. 70(3):419–435.Crossref, Google Scholar
Gu H, Guo X, Wei X, Xu R (2021) Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM J. Math. Data Sci. 3(4):1168–1196.Crossref, Google Scholar
Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Adv. Neural Inform. Processing Systems 32:4966–4976.Google Scholar
Huang M, Malhamé RP, Caines PE (2006) Large population stochastic dynamic games: Closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inform. Systems 6(3):221–252.Crossref, Google Scholar
Jin J, Song C, Li H, Gai K, Wang J, Zhang W (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. Proc. 27th ACM Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 2193–2201.Google Scholar
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
Lacker D (2015) Mean field games via controlled martingale problems: Existence of Markovian equilibria. Stochastic Processes Appl. 125(7):2856–2894.Crossref, Google Scholar
Lacker D (2017) Limit theory for controlled McKean–Vlasov dynamics. SIAM J. Control Optim. 55(3):1641–1672.Crossref, Google Scholar
Lasry JM, Lions PL (2007) Mean field games. Jpn. J. Math. 2(1):229–260.Crossref, Google Scholar
Laurière M, Pironneau O (2014) Dynamic programming for mean-field type control. Comptes Rendus Math. Acad. Sci. Paris 352(9):707–713.Crossref, Google Scholar
Li M, Qin Z, Jiao Y, Yang Y, Wang J, Wang C, Wu G, Ye J (2019) Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. WWW’19 World Wide Web Conference (Association for Computing Machinery, New York), 983–994.Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. Preprint, submitted September 9, https://arxiv.org/abs/1509.02971.Google Scholar
Mannor S, Tsitsiklis JN (2013) Algorithmic aspects of mean–variance optimization in Markov decision processes. Eur. J. Oper. Res. 231(3):645–653.Crossref, Google Scholar
McKean H (1969) Propagation of chaos for a class of non-linear parabolic equations. Stochastic Differential Equations, Lecture Series in Differential Equations, vol. 7 (Catholic University, Washington, DC), 41–57.Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.Crossref, Google Scholar
Motte M, Pham H (2019) Mean-field Markov decision processes with common noise and open-loop controls. Preprint, submitted December 17, https://arxiv.org/abs/1912.07883.Google Scholar
Nuño G (2017) Optimal social policies in mean field games. Appl. Math. Optim. 76(1):29–57.Crossref, Google Scholar
Pham H, Wei X (2016) Discrete time McKean–Vlasov control problem: A dynamic programming approach. Appl. Math. Optim. 74(3):487–506.Crossref, Google Scholar
Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Villani C (2009) Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, vol. 338 (Springer, Berlin).Crossref, Google Scholar
Vinyals O, Babuschkin I, Chung J, Mathieu M, Jaderberg M, Czarnecki WM, Dudzik A, et al. (2019) Alphastar: Mastering the real-time strategy game Starcraft II. Accessed June 15, 2019, https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii.Google Scholar
Wang L, Yang Z, Wang Z (2020) Breaking the curse of many agents: Provable mean embedding Q-iteration for mean-field reinforcement learning. Daumé H, Singh A, eds. ICML’20 Proc. 37th Internat. Conf. Machine Learn. (JMLR.org), 10092–10103.Google Scholar
Watkins CJ (1989) Learning from delayed rewards. Unpublished PhD thesis, King’s College, Cambridge, UK.Google Scholar
Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3-4):279–292.Crossref, Google Scholar
Wunder M, Littman ML, Babes M (2010) Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1167–1174.Google Scholar

Volume 71, Issue 4

July-August 2023

Pages iii-vi, 1021-1439, C2-C3

Article Information

Metrics

Information

Received:November 14, 2020
Accepted:August 24, 2022
Published Online:January 12, 2023

Cite as

Haotian Gu, Xin Guo, Xiaoli Wei, Renyuan Xu (2023) Dynamic Programming Principles for Mean-Field Controls with Learning. Operations Research 71(4):1040-1054.

https://doi.org/10.1287/opre.2022.2395

Keywords

Acknowledgments

The authors thank the area editor, associate editor, and two anonymous referees whose comments helped them significantly strengthen both the theoretical and computational results.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Dynamic Programming Principles for Mean-Field Controls with Learning

References

Volume 71, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News