Open Access

Global Optimality Guarantees for Policy Gradient Methods

Jalaj Bhandari
Jalaj Bhandari
[email protected]
https://orcid.org/0000-0002-7115-8986
Operations Research, Columbia University, New York, New York 10027;
Search for more papers by this author
,
Daniel Russo
Corresponding Author
Daniel Russo
[email protected]
https://orcid.org/0000-0001-5926-8624
Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author

Jalaj Bhandari

[email protected]

https://orcid.org/0000-0002-7115-8986

Operations Research, Columbia University, New York, New York 10027;

Search for more papers by this author

Daniel Russo

Corresponding Author

Daniel Russo

[email protected]

https://orcid.org/0000-0001-5926-8624

Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:5 Jan 2024https://doi.org/10.1287/opre.2021.0014

References

Agarwal A, Kakade SM, Lee JD, Mahajan G (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 64–66.Google Scholar
Agarwal N, Allen-Zhu Z, Bullins B, Hazan E, Ma T (2017) Finding approximate local minima faster than gradient descent. Hatami PM, King V, eds. Proc. 49th Annual ACM SIGACT Sympos. on Theory of Comput. (ACM, New York), 1195–1199.Google Scholar
Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer Science & Business Media, Berlin).Crossref, Google Scholar
Bagnell A, Kakade SM, Schneider JG, Ng AY (2004) Policy search by dynamic programming, vol. 16. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 831–838.Google Scholar
Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15(1):319–350.Crossref, Google Scholar
Beck A (2002) Convergence rate analysis of gradient based algorithms. PhD thesis, Tel-Aviv University, Tel Aviv, Israel.Google Scholar
Beck A (2017) First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia).Crossref, Google Scholar
Beck A, Teboulle M (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3):167–175.Crossref, Google Scholar
Bertsekas DP (1995) Dynamic Programming and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334–334.Crossref, Google Scholar
Bertsekas DP (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
Bertsekas DP (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J. Automated Sinica 6(1):1–31.Crossref, Google Scholar
Bertsekas DP, Shreve S (1978) Stochastic Optimal Control: The Discrete-Time Case (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming, vol. 5 (Athena Scientific, Belmont, MA).Google Scholar
Bhandari J, Russo D (2021) Global optimiality guarantees for policy gradient methods: Technical report with supplementary materials. Accessed January 12, 2022, https://djrusso.github.io/docs/policy_grad_optimality_EC.pdf.Google Scholar
Bhojanapalli S, Neyshabur B, Srebro N (2016) Global optimality of local search for low rank matrix recovery. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3873–3881.Google Scholar
Blackwell D (1965) Discounted dynamic programming. Ann. Math. Statist. 36(1):226–235.Crossref, Google Scholar
Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Caramanis M, Liberopoulos G (1992) Perturbation analysis for the design of flexible manufacturing system flow controllers. Oper. Res. 40(6):1107–1125.Link, Google Scholar
Carmon Y, Duchi JC, Hinder O, Sidford A (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.Crossref, Google Scholar
Chen J, Jiang N (2019) Information-theoretic considerations in batch reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Foundation Comput. Math. 20(1):119–154.Crossref, Google Scholar
Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1646–1654.Google Scholar
Du SS, Lee JD (2018) On the power of over-parametrization in neural networks with quadratic activation. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1329–1338.Google Scholar
Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Bengio Y, LeCun Y, eds. Proc. 8th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY), 11427–11446.Google Scholar
Evans LC (2005) An Introduction to Mathematical Optimal Control Theory. Lecture Notes (University of California, Department of Mathematics, Berkeley, CA).Google Scholar
Farahmand Am, Szepesvári C, Munos R (2010) Error propagation for approximate policy and value iteration. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 568–576.Google Scholar
Fazel M, Ge R, Kakade S, Mesbahi M (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1467–1476.Google Scholar
Ferns N, Panangaden P, Precup D (2004) Metrics for finite Markov decision processes. Meek C, Chickering M, Halpern J, eds. Proc. 20th Conf. Uncertainty in Artificial Intelligence (AUAI Press, Charlottesville, VA), 162–169.Google Scholar
Fu MC (2006) Gradient estimation. Handbook Oper. Res. Management Sci. 13:575–616.Google Scholar
Ge R, Lee JD, Ma T (2016) Matrix completion has no spurious local minimum. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 2973–2981.Google Scholar
Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points-online stochastic gradient for tensor decomposition. Grünwald P, Hazan E, Kale S, eds. Proc. 28th Conf. Learn. Theory, vol. 40 (PMLR, New York), 797–842.Google Scholar
Geist M, Piot B, Pietquin O (2017) Is the Bellman residual a bad proxy? Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3208–3217.Google Scholar
Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 2160–2169.Google Scholar
Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.Crossref, Google Scholar
Glasserman P, Ho YC (1991) Gradient Estimation via Perturbation Analysis, vol. 116 (Springer Science & Business Media, Berlin).Google Scholar
Glasserman P, Tayur S (1995) Sensitivity analysis for base-stock levels in multiechelon production-inventory systems. Management Sci. 41(2):263–281.Link, Google Scholar
Hernández-Lerma O, Lasserre JB (2012) Discrete-Time Markov Control Processes: Basic Optimality Criteria, vol. 30 (Springer Science & Business Media, Berlin).Google Scholar
Hewer G (1971) An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Automated Control 16(4):382–384.Crossref, Google Scholar
Huh WT, Rusmevichientong P (2013) Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS J. Comput. 26(1):150–159.Link, Google Scholar
Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8571–8580.Google Scholar
Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 2137–2143.Google Scholar
Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 1724–1732.Google Scholar
Kakade SM (2002) A natural policy gradient. Dietterich TG, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann AG, eds. Proc. 19th Internat. Conf. Machine Learn., vol. 2 (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
Kallus N, Uehara M (2020a) Doubly robust off-policy value and gradient estimation for deterministic policies. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 10420–10430.Google Scholar
Kallus N, Uehara M (2020b) Statistically efficient off-policy policy gradients. III HD, Singh A, eds., Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
Karimi H, Nutini J, Schmidt M (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 795–811.Google Scholar
Kawaguchi K (2016) Deep learning without poor local minima. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 586–594.Google Scholar
Kleinman D (1968) On an iterative technique for Riccati equation computations. IEEE Trans. Automated Control 13(1):114–115.Crossref, Google Scholar
Kunnumkal S, Topaloglu H (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.Link, Google Scholar
L’Ecuyer P, Glynn PW (1994) Stochastic optimization by simulation: Convergence proofs for the gi/g/1 queue in steady-state. Management Sci. 40(11):1562–1578.Link, Google Scholar
Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 1246–1257.Google Scholar
Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 855–863.Google Scholar
Marbach P, Tsitsiklis JN (2001) Simulation-based optimization of Markov reward processes. IEEE Trans. Automated Control 46(2):191–209.Crossref, Google Scholar
Maxwell MS, Henderson SG, Topaloglu H (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.Link, Google Scholar
Mohamed S, Rosca M, Figurnov M, Mnih A (2020) Monte Carlo gradient estimation in machine learning. J. Machine Learn. Res. 21(132):1–62.Google Scholar
Munos R (2003) Error bounds for approximate policy iteration. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Palo Alto, CA), 560–567.Google Scholar
Munos R (2005) Error bounds for approximate value iteration. Proc. National Conf. Artificial Intelligence, vol. 20 (AAAI Press, Palo Alto, CA), 1006.Google Scholar
Munos R (2007) Performance bounds in L_p-norm for approximate value iteration. SIAM J. Control Optim. 46(2):541–561.Crossref, Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math. Programming 108(1):177–205.Crossref, Google Scholar
Osband I, Van Roy B, Russo DJ, Wen Z (2019) Deep exploration via randomized value functions. J. Maching Learn. Res. 20(124):1–62.Google Scholar
Peters J, Schaal S (2006) Policy gradient methods for robotics. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots and Systems (IEEE, Piscataway, NJ), 2219–2225.Google Scholar
Pflug GC (1988) Derivatives of probability measures-concepts and applications to the optimization of stochastic systems. Varaiya P, Kurzhanski A, eds. Discrete Event Systems: Models and Applications. Lecture Notes in Control and Information Sciences, vol. 103 (Springer, Berlin), 252–274.Google Scholar
Pflug GC (1990) On-line optimization of simulated Markovian processes. Math. Oper. Res. 15(3):381–395.Link, Google Scholar
Polyak BT (1963) Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Physics 3(4):864–878.Crossref, Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
Rajeswaran A, Lowrey K, Todorov EV, Kakade SM (2017) Toward generalization and simplicity in continuous control. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 6550–6561.Google Scholar
Rautert T, Sachs EW (1997) Computational design of optimal output feedback controllers. SIAM J. Optim. 7(3):837–852.Crossref, Google Scholar
Reddi SJ, Sra S, Póczos B, Smola A (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. Proc. 54th Annual Allerton Conf. Comm., Control, and Comput. (Allerton) (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
Reddi SJ, Sra S, Poczos B, Smola AJ (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1145–1153.Google Scholar
Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016c) Stochastic variance reduction for nonconvex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 314–323.Google Scholar
Rhee CH, Glynn P (2017) Lyapunov conditions for differentiability of Markov chain expectations: The absolutely continuous case. Preprint, submitted July 12, https://arxiv.org/abs/ 1707.03870.Google Scholar
Riedmiller M, Peters J, Schaal S (2007) Evaluation of policy gradient methods and variants on the cart-pole benchmark. Proc. IEEE Internat. Sympos. Approximate Dynamic Programming and Reinforcement Learn. (IEEE, Piscataway, NJ), 254–261.Google Scholar
Salimans T, Ho J, Chen X, Sidor S, Sutskever I (2017) Evolution strategies as a scalable alternative to reinforcement learning. Preprint, submitted September 7, https://arxiv.org/abs/ 1703.03864.Google Scholar
Scherrer B, Geist M (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 35–50.Google Scholar
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, New York), 1889–1897.Google Scholar
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. Bengio Y, LeCun Y, eds. Proc. 4th Internat. Conf. Learn. Representations.Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted August 28, https://arxiv.org/abs/ 1707.06347.Google Scholar
Shani L, Efroni Y, Mannor S (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPS. Rossi F, Conitzer V, Sha F, eds. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34 (AAAI Press, Palo Alto, CA).Google Scholar
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, New York), 387–395.Google Scholar
Singh SP, Jaakkola T, Jordan MI (1994) Reinforcement learning with soft state aggregation. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 361–368.Google Scholar
Sun J, Qu Q, Wright J (2017) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.Crossref, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Sutton RS, McAllester DA, Singh SP, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
Toivonen HT (1985) A globally convergent algorithm for the optimal constant output feedback problem. Internat. J. Control 41(6):1589–1599.Crossref, Google Scholar
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
Uehara M, Imaizumi M, Jiang N, Kallus N, Sun W, Xie T (2021) Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. Preprint, submitted February 5, https://arxiv.org/abs/ 2102.02981.Google Scholar
Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
Wang R, Foster D, Kakade SM (2021) What are the statistical limits of offline RL with linear function approximation? Bengio Y, LeCun Y, eds. Proc. 9th Internat. Conf. Learn. Representations.Google Scholar
Wang L, Cai Q, Yang Z, Wang Z (2019) Neural policy gradient methods: Global optimality and rates of convergence. Bengio Y, LeCun Y, eds. Proc. 7th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY).Google Scholar
Weisz G, Amortila P, Szepesvári C (2021) Exponential lower bounds for planning in MDPs with linearly realizable optimal action-value functions. Proc. Machine Learn. Res. (PMLR, New York), 1237–1264.Google Scholar
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.Crossref, Google Scholar

Volume 72, Issue 5

September-October 2024

Pages iii-vii, 1751-2261, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:January 08, 2021
Accepted:December 01, 2022
Published Online:January 05, 2024

Cite as

Jalaj Bhandari, Daniel Russo (2024) Global Optimality Guarantees for Policy Gradient Methods. Operations Research 72(5):1906-1927.

https://doi.org/10.1287/opre.2021.0014

Keywords

Acknowledgments

The authors thank the anonymous referees for feedback that helped improve some aspects of the paper. A part of this work was completed when J. Bhandari was a research fellow at the Theory of Reinforcement learning program at the Simons Institute for the Theory of Computing, University of California, Berkeley. J. Bhandari thanks Peter Bartlett and Simons Institute for that opportunity and Gaurd Iyengar for love and support during the PhD program at Columbia University.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Global Optimality Guarantees for Policy Gradient Methods

References

Volume 72, Issue 5

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News