Analyzing Approximate Value Iteration Algorithms

Published Online:https://doi.org/10.1287/moor.2021.1202

References

  • [1] Abounadi J, Bertsekas DP, Borkar V (2002) Stochastic approximation for nonexpansive maps: Application to Q-learning algorithms. SIAM J. Control Optim. 41(1):1–22.CrossrefGoogle Scholar
  • [2] Aubin J, Cellina A (1984) Differential Inclusions: Set-Valued Maps and Viability Theory (Springer, Berlin).CrossrefGoogle Scholar
  • [3] Benaïm M (1996) A dynamical system approach to stochastic approximations. SIAM J. Control Optim. 34(2):437–472.CrossrefGoogle Scholar
  • [4] Benaïm M, Hirsch MW (1996) Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dynam. Differential Equations 8:141–176.CrossrefGoogle Scholar
  • [5] Benaïm M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1):328–348.CrossrefGoogle Scholar
  • [6] Bertsekas DP (2013) Abstract Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • [7] Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming, 1st ed. (Athena Scientific, Belmont, MA).Google Scholar
  • [8] Billingsley P (2013) Convergence of Probability Measures (John Wiley & Sons, Hoboken, NY).Google Scholar
  • [9] Borkar VS (1997) Stochastic approximation with two time scales. Syst. Control Lett. 29(5):291–294.CrossrefGoogle Scholar
  • [10] Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [11] Borkar VS, Meyn SP (1999) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.CrossrefGoogle Scholar
  • [12] De Farias DP, Van Roy B (2000) On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optim. Theory Appl. 105(3):589–608.CrossrefGoogle Scholar
  • [13] Jianqing F, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep Q-learning. Proc. Second Conf. Learning Dynam. Control. Proc. Machine Learn. Res., vol. 120 (PMLR), 486–489.Google Scholar
  • [14] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.CrossrefGoogle Scholar
  • [15] Munos R (2005) Error bounds for approximate value iteration. Cohn A, ed. Proc. 20th Natl. Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 1006–1011.Google Scholar
  • [16] Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration J. Machine Learning Res. 9(27):815−857. Google Scholar
  • [17] Nadler S (1969) Multi-valued contraction mappings. Pacific J. Math. 30(2):475–488.CrossrefGoogle Scholar
  • [18] Ramaswamy A, Bhatnagar B (2017) A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Math. Oper. Res. 42(3):648–661.LinkGoogle Scholar
  • [19] Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.CrossrefGoogle Scholar
  • [20] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.