Global Optimality Guarantees for Policy Gradient Methods

Published Online:https://doi.org/10.1287/opre.2021.0014

References

  • Agarwal A, Kakade SM, Lee JD, Mahajan G (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 64–66.Google Scholar
  • Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. Hatami PM, King V, eds. Proc. 49th Annual ACM SIGACT Sympos. on Theory of Comput. (ACM, New York), 1195–1199.Google Scholar
  • Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.CrossrefGoogle Scholar
  • Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer Science & Business Media, Berlin).CrossrefGoogle Scholar
  • Bagnell A, Kakade SM, Schneider JG, Ng AY (2004) Policy search by dynamic programming, vol. 16. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 831–838.Google Scholar
  • Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15(1):319–350.CrossrefGoogle Scholar
  • Beck A (2002) Convergence rate analysis of gradient based algorithms. PhD thesis, Tel-Aviv University, Tel Aviv, Israel.Google Scholar
  • Beck A (2017) First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia).CrossrefGoogle Scholar
  • Beck A, Teboulle M (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3):167–175.CrossrefGoogle Scholar
  • Bertsekas DP (1995) Dynamic Programming and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas DP (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334–334.CrossrefGoogle Scholar
  • Bertsekas DP (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.CrossrefGoogle Scholar
  • Bertsekas DP (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J. Automated Sinica 6(1):1–31.CrossrefGoogle Scholar
  • Bertsekas DP, Shreve S (1978) Stochastic Optimal Control: The Discrete-Time Case (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming, vol. 5 (Athena Scientific, Belmont, MA).Google Scholar
  • Bhandari J, Russo D (2021) Global optimiality guarantees for policy gradient methods: Technical report with supplementary materials. Accessed January 12, 2022, https://djrusso.github.io/docs/policy_grad_optimality_EC.pdf.Google Scholar
  • Bhojanapalli S, Neyshabur B, Srebro N (2016) Global optimality of local search for low rank matrix recovery. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3873–3881.Google Scholar
  • Blackwell D (1965) Discounted dynamic programming. Ann. Math. Statist. 36(1):226–235.CrossrefGoogle Scholar
  • Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
  • Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Caramanis M, Liberopoulos G (1992) Perturbation analysis for the design of flexible manufacturing system flow controllers. Oper. Res. 40(6):1107–1125.LinkGoogle Scholar
  • Carmon Y, Duchi JC, Hinder O, Sidford A (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.CrossrefGoogle Scholar
  • Chen J, Jiang N (2019) Information-theoretic considerations in batch reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
  • Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Foundation Comput. Math. 20(1):119–154.CrossrefGoogle Scholar
  • Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1646–1654.Google Scholar
  • Du SS, Lee JD (2018) On the power of over-parametrization in neural networks with quadratic activation. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1329–1338.Google Scholar
  • Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Bengio Y, LeCun Y, eds. Proc. 8th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY), 11427–11446.Google Scholar
  • Evans LC (2005) An Introduction to Mathematical Optimal Control Theory. Lecture Notes (University of California, Department of Mathematics, Berkeley, CA).Google Scholar
  • Farahmand Am, Szepesvári C, Munos R (2010) Error propagation for approximate policy and value iteration. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 568–576.Google Scholar
  • Fazel M, Ge R, Kakade S, Mesbahi M (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1467–1476.Google Scholar
  • Ferns N, Panangaden P, Precup D (2004) Metrics for finite Markov decision processes. Meek C, Chickering M, Halpern J, eds. Proc. 20th Conf. Uncertainty in Artificial Intelligence (AUAI Press, Charlottesville, VA), 162–169.Google Scholar
  • Fu MC (2006) Gradient estimation. Handbook Oper. Res. Management Sci. 13:575–616.Google Scholar
  • Ge R, Lee JD, Ma T (2016) Matrix completion has no spurious local minimum. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 2973–2981.Google Scholar
  • Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points-online stochastic gradient for tensor decomposition. Grünwald P, Hazan E, Kale S, eds. Proc. 28th Conf. Learn. Theory, vol. 40 (PMLR, New York), 797–842.Google Scholar
  • Geist M, Piot B, Pietquin O (2017) Is the Bellman residual a bad proxy? Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3208–3217.Google Scholar
  • Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 2160–2169.Google Scholar
  • Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.CrossrefGoogle Scholar
  • Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.CrossrefGoogle Scholar
  • Glasserman P, Ho YC (1991) Gradient Estimation via Perturbation Analysis, vol. 116 (Springer Science & Business Media, Berlin).Google Scholar
  • Glasserman P, Tayur S (1995) Sensitivity analysis for base-stock levels in multiechelon production-inventory systems. Management Sci. 41(2):263–281.LinkGoogle Scholar
  • Hernández-Lerma O, Lasserre JB (2012) Discrete-Time Markov Control Processes: Basic Optimality Criteria, vol. 30 (Springer Science & Business Media, Berlin).Google Scholar
  • Hewer G (1971) An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Automated Control 16(4):382–384.CrossrefGoogle Scholar
  • Huh WT, Rusmevichientong P (2013) Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS J. Comput. 26(1):150–159.LinkGoogle Scholar
  • Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8571–8580.Google Scholar
  • Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 2137–2143.Google Scholar
  • Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 1724–1732.Google Scholar
  • Kakade SM (2002) A natural policy gradient. Dietterich TG, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann AG, eds. Proc. 19th Internat. Conf. Machine Learn., vol. 2 (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
  • Kallus N, Uehara M (2020a) Doubly robust off-policy value and gradient estimation for deterministic policies. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 10420–10430.Google Scholar
  • Kallus N, Uehara M (2020b) Statistically efficient off-policy policy gradients. III HD, Singh A, eds., Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
  • Karimi H, Nutini J, Schmidt M (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 795–811.Google Scholar
  • Kawaguchi K (2016) Deep learning without poor local minima. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 586–594.Google Scholar
  • Kleinman D (1968) On an iterative technique for Riccati equation computations. IEEE Trans. Automated Control 13(1):114–115.CrossrefGoogle Scholar
  • Kunnumkal S, Topaloglu H (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.LinkGoogle Scholar
  • L’Ecuyer P, Glynn PW (1994) Stochastic optimization by simulation: Convergence proofs for the gi/g/1 queue in steady-state. Management Sci. 40(11):1562–1578.LinkGoogle Scholar
  • Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 1246–1257.Google Scholar
  • Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 855–863.Google Scholar
  • Marbach P, Tsitsiklis JN (2001) Simulation-based optimization of Markov reward processes. IEEE Trans. Automated Control 46(2):191–209.CrossrefGoogle Scholar
  • Maxwell MS, Henderson SG, Topaloglu H (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.LinkGoogle Scholar
  • Mohamed S, Rosca M, Figurnov M, Mnih A (2020) Monte Carlo gradient estimation in machine learning. J. Machine Learn. Res. 21(132):1–62.Google Scholar
  • Munos R (2003) Error bounds for approximate policy iteration. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Palo Alto, CA), 560–567.Google Scholar
  • Munos R (2005) Error bounds for approximate value iteration. Proc. National Conf. Artificial Intelligence, vol. 20 (AAAI Press, Palo Alto, CA), 1006.Google Scholar
  • Munos R (2007) Performance bounds in L_p-norm for approximate value iteration. SIAM J. Control Optim. 46(2):541–561.CrossrefGoogle Scholar
  • Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
  • Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math. Programming 108(1):177–205.CrossrefGoogle Scholar
  • Osband I, Van Roy B, Russo DJ, Wen Z (2019) Deep exploration via randomized value functions. J. Maching Learn. Res. 20(124):1–62.Google Scholar
  • Peters J, Schaal S (2006) Policy gradient methods for robotics. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots and Systems (IEEE, Piscataway, NJ), 2219–2225.Google Scholar
  • Pflug GC (1988) Derivatives of probability measures-concepts and applications to the optimization of stochastic systems. Varaiya P, Kurzhanski A, eds. Discrete Event Systems: Models and Applications. Lecture Notes in Control and Information Sciences, vol. 103 (Springer, Berlin), 252–274.Google Scholar
  • Pflug GC (1990) On-line optimization of simulated Markovian processes. Math. Oper. Res. 15(3):381–395.LinkGoogle Scholar
  • Polyak BT (1963) Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Physics 3(4):864–878.CrossrefGoogle Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Rajeswaran A, Lowrey K, Todorov EV, Kakade SM (2017) Toward generalization and simplicity in continuous control. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 6550–6561.Google Scholar
  • Rautert T, Sachs EW (1997) Computational design of optimal output feedback controllers. SIAM J. Optim. 7(3):837–852.CrossrefGoogle Scholar
  • Reddi SJ, Sra S, Póczos B, Smola A (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. Proc. 54th Annual Allerton Conf. Comm., Control, and Comput. (Allerton) (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
  • Reddi SJ, Sra S, Poczos B, Smola AJ (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1145–1153.Google Scholar
  • Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016c) Stochastic variance reduction for nonconvex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 314–323.Google Scholar
  • Rhee CH, Glynn P (2017) Lyapunov conditions for differentiability of Markov chain expectations: The absolutely continuous case. Preprint, submitted July 12, https://arxiv.org/abs/ 1707.03870.Google Scholar
  • Riedmiller M, Peters J, Schaal S (2007) Evaluation of policy gradient methods and variants on the cart-pole benchmark. Proc. IEEE Internat. Sympos. Approximate Dynamic Programming and Reinforcement Learn. (IEEE, Piscataway, NJ), 254–261.Google Scholar
  • Salimans T, Ho J, Chen X, Sidor S, Sutskever I (2017) Evolution strategies as a scalable alternative to reinforcement learning. Preprint, submitted September 7, https://arxiv.org/abs/ 1703.03864.Google Scholar
  • Scherrer B, Geist M (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 35–50.Google Scholar
  • Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, New York), 1889–1897.Google Scholar
  • Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. Bengio Y, LeCun Y, eds. Proc. 4th Internat. Conf. Learn. Representations.Google Scholar
  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted August 28, https://arxiv.org/abs/ 1707.06347.Google Scholar
  • Shani L, Efroni Y, Mannor S (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPS. Rossi F, Conitzer V, Sha F, eds. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34 (AAAI Press, Palo Alto, CA).Google Scholar
  • Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, New York), 387–395.Google Scholar
  • Singh SP, Jaakkola T, Jordan MI (1994) Reinforcement learning with soft state aggregation. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 361–368.Google Scholar
  • Sun J, Qu Q, Wright J (2017) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.CrossrefGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Sutton RS, McAllester DA, Singh SP, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
  • Toivonen HT (1985) A globally convergent algorithm for the optimal constant output feedback problem. Internat. J. Control 41(6):1589–1599.CrossrefGoogle Scholar
  • Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.CrossrefGoogle Scholar
  • Uehara M, Imaizumi M, Jiang N, Kallus N, Sun W, Xie T (2021) Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. Preprint, submitted February 5, https://arxiv.org/abs/ 2102.02981.Google Scholar
  • Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.LinkGoogle Scholar
  • Wang R, Foster D, Kakade SM (2021) What are the statistical limits of offline RL with linear function approximation? Bengio Y, LeCun Y, eds. Proc. 9th Internat. Conf. Learn. Representations.Google Scholar
  • Wang L, Cai Q, Yang Z, Wang Z (2019) Neural policy gradient methods: Global optimality and rates of convergence. Bengio Y, LeCun Y, eds. Proc. 7th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY).Google Scholar
  • Weisz G, Amortila P, Szepesvári C (2021) Exponential lower bounds for planning in MDPs with linearly realizable optimal action-value functions. Proc. Machine Learn. Res. (PMLR, New York), 1237–1264.Google Scholar
  • Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.