Dynamic Programming Principles for Mean-Field Controls with Learning

Published Online:https://doi.org/10.1287/opre.2022.2395

References

  • Aïd R, Basei M, Pham H (2020) A McKean–Vlasov approach to distributed electricity generation development. Math. Methods Oper. Res. 91(2):269–310.CrossrefGoogle Scholar
  • Andersson D, Djehiche B (2011) A maximum principle for SDEs of mean-field type. Appl. Math. Optim. 63(3):341–356.CrossrefGoogle Scholar
  • Bellman R (1957) A Markovian decision process. J. Math. Mechanics 6(5):679–684.Google Scholar
  • Bensoussan A, Frehse J, Yam P (2013) Mean Field Games and Mean Field Type Control Theory, SpringerBriefs in Mathematics, vol. 101 (Springer, New York).CrossrefGoogle Scholar
  • Bertsekas DP, Shreve SE (1978) Stochastic Optimal Control: The Discrete-Time Case, Mathematics in Science and Engineering, vol. 139 (Academic Press, New York).Google Scholar
  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Buckdahn R, Djehiche B, Li J (2011) A general stochastic maximum principle for SDEs of mean-field type. Appl. Math. Optim. 64(2):197–216.CrossrefGoogle Scholar
  • Carmona R, Delarue F (2015) Forward–backward stochastic differential equations and controlled McKean–Vlasov dynamics. Ann. Probab. 43(5):2647–2700.CrossrefGoogle Scholar
  • Carmona R, Delarue F (2018a) Probabilistic Theory of Mean Field Games with Applications I, Probability Theory and Stochastic Modelling, vol. 83 (Springer, Cham, Switzerland).Google Scholar
  • Carmona R, Delarue F (2018b) Probabilistic Theory of Mean Field Games with Applications II, Probability Theory and Stochastic Modelling, vol. 84 (Springer, Cham, Switzerland).Google Scholar
  • Carmona R, Laurière M, Tan Z (2019a) Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. Preprint, submitted October 9, https://arxiv.org/abs/1910.04295.Google Scholar
  • Carmona R, Laurière M, Tan Z (2019b) Model-free mean-field reinforcement learning: Mean-field MDP and mean-field Q-learning. Preprint, submitted October 28, https://arxiv.org/abs/1910.12802.Google Scholar
  • Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. Proc. 15th Natl./10th Conf. Artificial Intelligence/Innovative Appl. Artificial Intelligence (AAAI Press, Palo Alto, CA), 761–768.Google Scholar
  • Djete MF, Possamaï D, Tan X (2019) McKean-Vlasov optimal control: The dynamic programming principle. Preprint, submitted July 20, https://arxiv.org/abs/1907.08860.Google Scholar
  • Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput. 12(1):219–245.CrossrefGoogle Scholar
  • Doya K, Samejima K, Katagiri K-i, Kawato M (2002) Multiple model-based reinforcement learning. Neural Comput. 14(6):1347–1369.CrossrefGoogle Scholar
  • El-Tantawy S, Abdulhai B, Abdelgawad H (2013) Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale application on downtown Toronto. IEEE Trans. Intelligent Transportation Systems 14(3):1140–1150.CrossrefGoogle Scholar
  • Even-Dar E, Mansour Y, Bartlett P (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(1):1–25.Google Scholar
  • Fleming WH, Soner HM (2006) Controlled Markov Processes and Viscosity Solutions, Stochastic Modelling and Applied Probability, vol. 25. (Springer Science & Business Media, New York).Google Scholar
  • Garnier J, Papanicolaou G, Yang TW (2013) Large deviations for a mean field model of systemic risk. SIAM J. Financial Math. 4(1):151–184.CrossrefGoogle Scholar
  • Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Internat. Statist. Rev. 70(3):419–435.CrossrefGoogle Scholar
  • Gu H, Guo X, Wei X, Xu R (2021) Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM J. Math. Data Sci. 3(4):1168–1196.CrossrefGoogle Scholar
  • Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. Adv. Neural Inform. Processing Systems 32:4966–4976.Google Scholar
  • Huang M, Malhamé RP, Caines PE (2006) Large population stochastic dynamic games: Closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inform. Systems 6(3):221–252.CrossrefGoogle Scholar
  • Jin J, Song C, Li H, Gai K, Wang J, Zhang W (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. Proc. 27th ACM Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 2193–2201.Google Scholar
  • Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 12:1008–1014.Google Scholar
  • Lacker D (2015) Mean field games via controlled martingale problems: Existence of Markovian equilibria. Stochastic Processes Appl. 125(7):2856–2894.CrossrefGoogle Scholar
  • Lacker D (2017) Limit theory for controlled McKean–Vlasov dynamics. SIAM J. Control Optim. 55(3):1641–1672.CrossrefGoogle Scholar
  • Lasry JM, Lions PL (2007) Mean field games. Jpn. J. Math. 2(1):229–260.CrossrefGoogle Scholar
  • Laurière M, Pironneau O (2014) Dynamic programming for mean-field type control. Comptes Rendus Math. Acad. Sci. Paris 352(9):707–713.CrossrefGoogle Scholar
  • Li M, Qin Z, Jiao Y, Yang Y, Wang J, Wang C, Wu G, Ye J (2019) Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. WWW’19 World Wide Web Conference (Association for Computing Machinery, New York), 983–994.Google Scholar
  • Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. Preprint, submitted September 9, https://arxiv.org/abs/1509.02971.Google Scholar
  • Mannor S, Tsitsiklis JN (2013) Algorithmic aspects of mean–variance optimization in Markov decision processes. Eur. J. Oper. Res. 231(3):645–653.CrossrefGoogle Scholar
  • McKean H (1969) Propagation of chaos for a class of non-linear parabolic equations. Stochastic Differential Equations, Lecture Series in Differential Equations, vol. 7 (Catholic University, Washington, DC), 41–57.Google Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.CrossrefGoogle Scholar
  • Motte M, Pham H (2019) Mean-field Markov decision processes with common noise and open-loop controls. Preprint, submitted December 17, https://arxiv.org/abs/1912.07883.Google Scholar
  • Nuño G (2017) Optimal social policies in mean field games. Appl. Math. Optim. 76(1):29–57.CrossrefGoogle Scholar
  • Pham H, Wei X (2016) Discrete time McKean–Vlasov control problem: A dynamic programming approach. Appl. Math. Optim. 74(3):487–506.CrossrefGoogle Scholar
  • Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. Preprint, submitted October 11, https://arxiv.org/abs/1610.03295.Google Scholar
  • Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.CrossrefGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Villani C (2009) Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, vol. 338 (Springer, Berlin).CrossrefGoogle Scholar
  • Vinyals O, Babuschkin I, Chung J, Mathieu M, Jaderberg M, Czarnecki WM, Dudzik A, et al. (2019) Alphastar: Mastering the real-time strategy game Starcraft II. Accessed June 15, 2019, https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii.Google Scholar
  • Wang L, Yang Z, Wang Z (2020) Breaking the curse of many agents: Provable mean embedding Q-iteration for mean-field reinforcement learning. Daumé H, Singh A, eds. ICML’20 Proc. 37th Internat. Conf. Machine Learn. (JMLR.org), 10092–10103.Google Scholar
  • Watkins CJ (1989) Learning from delayed rewards. Unpublished PhD thesis, King’s College, Cambridge, UK.Google Scholar
  • Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3-4):279–292.CrossrefGoogle Scholar
  • Wunder M, Littman ML, Babes M (2010) Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1167–1174.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.