Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization
Published Online:2 Dec 2021https://doi.org/10.1287/opre.2021.2151
References
- (2019) Reinforcement Learn.: Theory and algorithms. Technical report, University of Washington Seattle, Seattle, WA. Google Scholar
- (2020a) Model-based reinforcement Learn. with a generative model is minimax optimal. Proc. 33rd Conf. Learn. Theory, 67–83.Google Scholar
- (2020b) Optimality and approximation with policy gradient methods in Markov decision processes. Proc. 33rd Conf. Learn. Theory, 64–66.Google Scholar
- (2019) Understanding the impact of entropy on policy optimization. Proc. 36th Internat. Conf. Machine Learn., 151–160.Google Scholar
- (1998) Natural gradient works efficiently in Learn.. Neural Comput. 10(2):251–276.Crossref, Google Scholar
- (2013) Minimax PAC bounds on the sample complexity of reinforcement Learn. with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
- (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716.Crossref, Google Scholar
- (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
- (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
- (2020) A note on the linear convergence of policy gradient methods. Preprint, submitted July 21, https://arxiv.org/abs/2007.11120.Google Scholar
- (2009) Natural actor-critic algorithms. Automatica J. IFAC. 45(11):2471–2482.Crossref, Google Scholar
- (2019) Provably efficient exploration in policy optimization. Proc. 37th Conf. Machine Learn., PMLR 119:1283–1294.Google Scholar
- Cen S, Wei Y, Chi Y (2021) Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. arXiv preprint arXiv:2105.15186.Google Scholar
- (2018) SBEED: Convergent reinforcement Learn. with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1125–1134.Google Scholar
- (2016) Benchmarking deep reinforcement Learn. for continuous control. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1329–1338.Google Scholar
- (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.Link, Google Scholar
- (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1467–1476.Google Scholar
- (2019) A theory of regularized Markov decision processes. Internat. Conf. Machine Learn., 2160–2169.Google Scholar
- (2019) Planning in entropy-regularized markov decision processes and games. Advances in Neural Information Processing Systems 32:12404–12413.Google Scholar
- (2017) Reinforcement Learn. with deep energy-based policies. Proc. 34th Internat. Conf. Machine Learn., PMLR, 70:1352–1361.Google Scholar
- (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement Learn. with a stochastic actor. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1861–1870.Google Scholar
- (2019) Provably efficient maximum entropy exploration. Proc. 36th Internat. Conf. Machine Learn., PMLR, 97:2681–2691.Google Scholar
- (2020) Convergence guarantees of policy optimization methods for Markovian jump linear systems. arXiv preprint arXiv:2002.04090.Google Scholar
- (2002) A natural policy gradient. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
- (2002) Approximately optimal approximate reinforcement Learn. Proc. 19th Internat. Conf. Machine Learn., 267–274.Google Scholar
- (2019) Non-asymptotic analysis of biased stochastic approximation scheme. Proc. Thirty-Second Conf. Learn. Theory, 99:1944–1974.Google Scholar
- (2000) Actor-critic algorithms. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA) 1008–1014.Google Scholar
- (2020a) Breaking the sample size barrier in model-based reinforcement Learn. with a generative model. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 12861–12872.Google Scholar
- (2020b) Sample complexity of asynchronous Q-Learn.: Sharper analysis and variance reduction. Preprint, submitted Jun 4, https://arxiv.org/abs/2006.03041.Google Scholar
- (2021b) Softmax policy gradient methods can take exponential time to converge. Belkin M, Kpotufe S, eds. Proc. 34th Conf. Learning Theory (PMLR), 134:3107–3110.Google Scholar
- (2021a) Is Q-Learn. minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.Google Scholar
- (2019). Neural proximal trust region policy optimization attains globally optimal policy. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada.Google Scholar
- (2020) On the global convergence rates of softmaz policy gradient methods. Proc. 37th Internat. Conf. Machine Learn., PMLR, 119:6820–6829.Google Scholar
- (2016). Asynchronous methods for deep reinforcement Learn. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1928–1937.Google Scholar
- (2015) Human-level control through deep reinforcement Learn. Nature 518(7540):529–533.Crossref, Google Scholar
- (2019). Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. Preprint, submitted December 26, https://arxiv.org/abs/1912.11899.Google Scholar
- (2017). Bridging the gap between value and policy based reinforcement Learn.. Preprint, submitted November 22, https://arxiv.org/abs/1702.08892.Google Scholar
- (1983) Problem complexity and method efficiency in optimization. (J. Wiley & Sons).Google Scholar
- (2009) Primal-dual subgradient methods for convex problems. Math. Programming 120(1):221–259.Crossref, Google Scholar
- (2017). A unified view of entropy-regularized Markov decision processes. Preprint, submitted May 22, https://arxiv.org/abs/1705.07798.Google Scholar
- (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190.Crossref, Google Scholar
- (2010) Relative entropy policy search. Proc. AAAI Conf. Artificial Intelligence, 24(1):1607–1612.Google Scholar
- (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2017a). Equivalence between policy gradients and soft Q-Learn. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
- (2015) Trust region policy optimization. Proc. 32nd Conf. Machine Learn., PMLR, 37:1889–1897.Google Scholar
- (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
- (2019) Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. Proc. AAAI Conf. Artificial Intelligence 34(4):5668–5675.Google Scholar
- (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Google Scholar
- (2000) Policy gradient methods for reinforcement Learn. with function approximation. NIPS’99, 1057–1063.Google Scholar
- (2019) The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. Proc. Thirty-Second Conf. Learn. Theory, PMLR, 99:3036–3083.Google Scholar
- (2020) Leverage the average: an analysis of KL regularization in RL. Preprint, submitted March 31, https://arxiv.org/abs/2003.14089.Google Scholar
- (2019) Neural policy gradient methods: Global optimality and rates of convergence. Preprint, submitted August 29, https://arxiv.org/abs/1909.01150.Google Scholar
- (1992) Simple statistical gradient-following algorithms for connectionist reinforcement Learn. Machine Learn. 8(3-4):229–256.Crossref, Google Scholar
- (1991) Function optimization using connectionist reinforcement Learn. algorithms. Connect. Sci. 3(3):241–268.Crossref, Google Scholar
- (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
- (2019) Maximum entropy Monte-Carlo planning. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 9520–9528.Google Scholar
- (2020) Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. Preprint, submitted May 7, https://arxiv.org/abs/2005.03557.Google Scholar
- (2019a) Policy optimization for ℋ2 linear control with ℋ∞ robustness guarantee: implicit regularization and global convergence. Proc. 2nd Conf. Learn. Dynam. Control, PMLR, 120:179–190.Google Scholar
- (2019b). Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control Optim. 58(6):3586–3612.Crossref, Google Scholar

