Global Optimality Guarantees for Policy Gradient Methods
References
- (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 64–66.Google Scholar
- (2017) Finding approximate local minima faster than gradient descent. Hatami PM, King V, eds. Proc. 49th Annual ACM SIGACT Sympos. on Theory of Comput. (ACM, New York), 1195–1199.Google Scholar
- (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
- (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer Science & Business Media, Berlin).Crossref, Google Scholar
- (2004) Policy search by dynamic programming, vol. 16. Thrun S, Saul L, Schölkopf B, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 831–838.Google Scholar
- (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15(1):319–350.Crossref, Google Scholar
- (2002) Convergence rate analysis of gradient based algorithms. PhD thesis, Tel-Aviv University, Tel Aviv, Israel.Google Scholar
- (2017) First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia).Crossref, Google Scholar
- (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3):167–175.Crossref, Google Scholar
- (1995) Dynamic Programming and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
- (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334–334.Crossref, Google Scholar
- (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
- (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J. Automated Sinica 6(1):1–31.Crossref, Google Scholar
- (1978) Stochastic Optimal Control: The Discrete-Time Case (Athena Scientific, Belmont, MA).Google Scholar
- (1996) Neuro-Dynamic Programming, vol. 5 (Athena Scientific, Belmont, MA).Google Scholar
- (2021) Global optimiality guarantees for policy gradient methods: Technical report with supplementary materials. Accessed January 12, 2022, https://djrusso.github.io/docs/policy_grad_optimality_EC.pdf.Google Scholar
- (2016) Global optimality of local search for low rank matrix recovery. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3873–3881.Google Scholar
- (1965) Discounted dynamic programming. Ann. Math. Statist. 36(1):226–235.Crossref, Google Scholar
- (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
- (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (1992) Perturbation analysis for the design of flexible manufacturing system flow controllers. Oper. Res. 40(6):1107–1125.Link, Google Scholar
- (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.Crossref, Google Scholar
- (2019) Information-theoretic considerations in batch reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
- (2020) Stochastic subgradient method converges on tame functions. Foundation Comput. Math. 20(1):119–154.Crossref, Google Scholar
- (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1646–1654.Google Scholar
- (2018) On the power of over-parametrization in neural networks with quadratic activation. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1329–1338.Google Scholar
- (2020) Is a good representation sufficient for sample efficient reinforcement learning? Bengio Y, LeCun Y, eds. Proc. 8th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY), 11427–11446.Google Scholar
- (2005) An Introduction to Mathematical Optimal Control Theory. Lecture Notes (University of California, Department of Mathematics, Berkeley, CA).Google Scholar
- (2010) Error propagation for approximate policy and value iteration. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 568–576.Google Scholar
- (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR, New York), 1467–1476.Google Scholar
- (2004) Metrics for finite Markov decision processes. Meek C, Chickering M, Halpern J, eds. Proc. 20th Conf. Uncertainty in Artificial Intelligence (AUAI Press, Charlottesville, VA), 162–169.Google Scholar
- (2006) Gradient estimation. Handbook Oper. Res. Management Sci. 13:575–616.Google Scholar
- (2016) Matrix completion has no spurious local minimum. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 2973–2981.Google Scholar
- (2015) Escaping from saddle points-online stochastic gradient for tensor decomposition. Grünwald P, Hazan E, Kale S, eds. Proc. 28th Conf. Learn. Theory, vol. 40 (PMLR, New York), 797–842.Google Scholar
- (2017) Is the Bellman residual a bad proxy? Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 3208–3217.Google Scholar
- (2019) A theory of regularized Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 2160–2169.Google Scholar
- (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
- (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.Crossref, Google Scholar
- (1991) Gradient Estimation via Perturbation Analysis, vol. 116 (Springer Science & Business Media, Berlin).Google Scholar
- (1995) Sensitivity analysis for base-stock levels in multiechelon production-inventory systems. Management Sci. 41(2):263–281.Link, Google Scholar
- (2012) Discrete-Time Markov Control Processes: Basic Optimality Criteria, vol. 30 (Springer Science & Business Media, Berlin).Google Scholar
- (1971) An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Automated Control 16(4):382–384.Crossref, Google Scholar
- (2013) Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS J. Comput. 26(1):150–159.Link, Google Scholar
- (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8571–8580.Google Scholar
- (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR, New York), 2137–2143.Google Scholar
- (2017) How to escape saddle points efficiently. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 1724–1732.Google Scholar
- (2002) A natural policy gradient. Dietterich TG, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
- (2002) Approximately optimal approximate reinforcement learning. Sammut C, Hoffmann AG, eds. Proc. 19th Internat. Conf. Machine Learn., vol. 2 (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
- (2020a) Doubly robust off-policy value and gradient estimation for deterministic policies. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 10420–10430.Google Scholar
- (2020b) Statistically efficient off-policy policy gradients. III HD, Singh A, eds., Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
- (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 795–811.Google Scholar
- (2016) Deep learning without poor local minima. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 586–594.Google Scholar
- (1968) On an iterative technique for Riccati equation computations. IEEE Trans. Automated Control 13(1):114–115.Crossref, Google Scholar
- (2008) Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Oper. Res. 56(3):646–664.Link, Google Scholar
- (1994) Stochastic optimization by simulation: Convergence proofs for the gi/g/1 queue in steady-state. Management Sci. 40(11):1562–1578.Link, Google Scholar
- (2016) Gradient descent only converges to minimizers. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 1246–1257.Google Scholar
- (2014) On the computational efficiency of training neural networks. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 855–863.Google Scholar
- (2001) Simulation-based optimization of Markov reward processes. IEEE Trans. Automated Control 46(2):191–209.Crossref, Google Scholar
- (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.Link, Google Scholar
- (2020) Monte Carlo gradient estimation in machine learning. J. Machine Learn. Res. 21(132):1–62.Google Scholar
- (2003) Error bounds for approximate policy iteration. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Palo Alto, CA), 560–567.Google Scholar
- (2005) Error bounds for approximate value iteration. Proc. National Conf. Artificial Intelligence, vol. 20 (AAAI Press, Palo Alto, CA), 1006.Google Scholar
- (2007) Performance bounds in L_p-norm for approximate value iteration. SIAM J. Control Optim. 46(2):541–561.Crossref, Google Scholar
- (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
- (2006) Cubic regularization of Newton method and its global performance. Math. Programming 108(1):177–205.Crossref, Google Scholar
- (2019) Deep exploration via randomized value functions. J. Maching Learn. Res. 20(124):1–62.Google Scholar
- (2006) Policy gradient methods for robotics. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots and Systems (IEEE, Piscataway, NJ), 2219–2225.Google Scholar
- (1988) Derivatives of probability measures-concepts and applications to the optimization of stochastic systems. Varaiya P, Kurzhanski A, eds. Discrete Event Systems: Models and Applications. Lecture Notes in Control and Information Sciences, vol. 103 (Springer, Berlin), 252–274.Google Scholar
- (1990) On-line optimization of simulated Markovian processes. Math. Oper. Res. 15(3):381–395.Link, Google Scholar
- (1963) Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Physics 3(4):864–878.Crossref, Google Scholar
- (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2017) Toward generalization and simplicity in continuous control. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 6550–6561.Google Scholar
- (1997) Computational design of optimal output feedback controllers. SIAM J. Optim. 7(3):837–852.Crossref, Google Scholar
- (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. Proc. 54th Annual Allerton Conf. Comm., Control, and Comput. (Allerton) (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
- (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1145–1153.Google Scholar
- (2016c) Stochastic variance reduction for nonconvex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 314–323.Google Scholar
- (2017) Lyapunov conditions for differentiability of Markov chain expectations: The absolutely continuous case. Preprint, submitted July 12, https://arxiv.org/abs/ 1707.03870.Google Scholar
- (2007) Evaluation of policy gradient methods and variants on the cart-pole benchmark. Proc. IEEE Internat. Sympos. Approximate Dynamic Programming and Reinforcement Learn. (IEEE, Piscataway, NJ), 254–261.Google Scholar
- (2017) Evolution strategies as a scalable alternative to reinforcement learning. Preprint, submitted September 7, https://arxiv.org/abs/ 1703.03864.Google Scholar
- (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Proc. Joint Eur. Conf. Machine Learn. and Knowledge Discovery in Databases (Springer, Berlin), 35–50.Google Scholar
- (2015) Trust region policy optimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., vol. 37 (PMLR, New York), 1889–1897.Google Scholar
- (2016) High-dimensional continuous control using generalized advantage estimation. Bengio Y, LeCun Y, eds. Proc. 4th Internat. Conf. Learn. Representations.Google Scholar
- (2017) Proximal policy optimization algorithms. Preprint, submitted August 28, https://arxiv.org/abs/ 1707.06347.Google Scholar
- (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPS. Rossi F, Conitzer V, Sha F, eds. Proc. 34th AAAI Conf. Artificial Intelligence, vol. 34 (AAAI Press, Palo Alto, CA).Google Scholar
- (2014) Deterministic policy gradient algorithms. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, New York), 387–395.Google Scholar
- (1994) Reinforcement learning with soft state aggregation. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 361–368.Google Scholar
- (2017) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.Crossref, Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
- (1985) A globally convergent algorithm for the optimal constant output feedback problem. Internat. J. Control 41(6):1589–1599.Crossref, Google Scholar
- (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
- (2021) Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. Preprint, submitted February 5, https://arxiv.org/abs/ 2102.02981.Google Scholar
- (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
- (2021) What are the statistical limits of offline RL with linear function approximation? Bengio Y, LeCun Y, eds. Proc. 9th Internat. Conf. Learn. Representations.Google Scholar
- (2019) Neural policy gradient methods: Global optimality and rates of convergence. Bengio Y, LeCun Y, eds. Proc. 7th Internat. Conf. Learn. Representations (Curran Associates, Red Hook, NY).Google Scholar
- (2021) Exponential lower bounds for planning in MDPs with linearly realizable optimal action-value functions. Proc. Machine Learn. Res. (PMLR, New York), 1237–1264.Google Scholar
- (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.Crossref, Google Scholar

