Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Published Online:https://doi.org/10.1287/opre.2021.2151

References

  • Agarwal A, Jiang N, Kakade SM (2019) Reinforcement Learn.: Theory and algorithms. Technical report, University of Washington Seattle, Seattle, WA. Google Scholar
  • Agarwal A, Kakade S, Yang LF (2020a) Model-based reinforcement Learn. with a generative model is minimax optimal. Proc. 33rd Conf. Learn. Theory, 67–83.Google Scholar
  • Agarwal A, Kakade SM, Lee JD, Mahajan G (2020b) Optimality and approximation with policy gradient methods in Markov decision processes. Proc. 33rd Conf. Learn. Theory, 64–66.Google Scholar
  • Ahmed Z, Le Roux N, Norouzi M, Schuurmans D (2019) Understanding the impact of entropy on policy optimization. Proc. 36th Internat. Conf. Machine Learn., 151–160.Google Scholar
  • Amari SI (1998) Natural gradient works efficiently in Learn.. Neural Comput. 10(2):251–276.CrossrefGoogle Scholar
  • Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement Learn. with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716.CrossrefGoogle Scholar
  • Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
  • Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
  • Bhandari J, Russo D (2020) A note on the linear convergence of policy gradient methods. Preprint, submitted July 21, https://arxiv.org/abs/2007.11120.Google Scholar
  • Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica J. IFAC. 45(11):2471–2482.CrossrefGoogle Scholar
  • Cai Q, Yang Z, Jin C, Wang Z (2019) Provably efficient exploration in policy optimization. Proc. 37th Conf. Machine Learn., PMLR 119:1283–1294.Google Scholar
  • Cen S, Wei Y, Chi Y (2021) Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. arXiv preprint arXiv:2105.15186.Google Scholar
  • Dai B, Shaw A, Li L, Xiao L, He N, Liu Z, Chen J, Song L (2018) SBEED: Convergent reinforcement Learn. with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1125–1134.Google Scholar
  • Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement Learn. for continuous control. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1329–1338.Google Scholar
  • Even-Dar E, Kakade SM, Mansour Y (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.LinkGoogle Scholar
  • Fazel M, Ge R, Kakade S, Mesbahi M (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1467–1476.Google Scholar
  • Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. Internat. Conf. Machine Learn., 2160–2169.Google Scholar
  • Grill JB, Darwiche Domingues O, Menard P, Munos R, Valko M (2019) Planning in entropy-regularized markov decision processes and games. Advances in Neural Information Processing Systems 32:12404–12413.Google Scholar
  • Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement Learn. with deep energy-based policies. Proc. 34th Internat. Conf. Machine Learn., PMLR, 70:1352–1361.Google Scholar
  • Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement Learn. with a stochastic actor. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1861–1870.Google Scholar
  • Hazan E, Kakade S, Singh K, Van Soest A (2019) Provably efficient maximum entropy exploration. Proc. 36th Internat. Conf. Machine Learn., PMLR, 97:2681–2691.Google Scholar
  • Jansch-Porto JP, Hu B, Dullerud G (2020) Convergence guarantees of policy optimization methods for Markovian jump linear systems. arXiv preprint arXiv:2002.04090.Google Scholar
  • Kakade SM (2002) A natural policy gradient. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement Learn. Proc. 19th Internat. Conf. Machine Learn., 267–274.Google Scholar
  • Karimi B, Miasojedow B, Moulines É, Wai HT (2019) Non-asymptotic analysis of biased stochastic approximation scheme. Proc. Thirty-Second Conf. Learn. Theory, 99:1944–1974.Google Scholar
  • Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA) 1008–1014.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020a) Breaking the sample size barrier in model-based reinforcement Learn. with a generative model. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 12861–12872.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020b) Sample complexity of asynchronous Q-Learn.: Sharper analysis and variance reduction. Preprint, submitted Jun 4, https://arxiv.org/abs/2006.03041.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2021b) Softmax policy gradient methods can take exponential time to converge. Belkin M, Kpotufe S, eds. Proc. 34th Conf. Learning Theory (PMLR), 134:3107–3110.Google Scholar
  • Li G, Cai C, Chen Y, Gu Y, Wei Y, Chi Y (2021a) Is Q-Learn. minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.Google Scholar
  • Liu B, Cai Q, Yang Z, Wang Z (2019). Neural proximal trust region policy optimization attains globally optimal policy. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada.Google Scholar
  • Mei J, Xiao C, Szepesvari C, Schuurmans D, (2020) On the global convergence rates of softmaz policy gradient methods. Proc. 37th Internat. Conf. Machine Learn., PMLR, 119:6820–6829.Google Scholar
  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016). Asynchronous methods for deep reinforcement Learn. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1928–1937.Google Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement Learn. Nature 518(7540):529–533.CrossrefGoogle Scholar
  • Mohammadi H, Zare A, Soltanolkotabi M, Jovanović MR (2019). Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. Preprint, submitted December 26, https://arxiv.org/abs/1912.11899.Google Scholar
  • Nachum O, Norouzi M, Xu K, Schuurmans D (2017). Bridging the gap between value and policy based reinforcement Learn.. Preprint, submitted November 22, https://arxiv.org/abs/1702.08892.Google Scholar
  • Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. (J. Wiley & Sons).Google Scholar
  • Nesterov Y (2009) Primal-dual subgradient methods for convex problems. Math. Programming 120(1):221–259.CrossrefGoogle Scholar
  • Neu G, Jonsson A, Gómez V (2017). A unified view of entropy-regularized Markov decision processes. Preprint, submitted May 22, https://arxiv.org/abs/1705.07798.Google Scholar
  • Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190.CrossrefGoogle Scholar
  • Peters J, Mulling K, Altun Y (2010) Relative entropy policy search. Proc. AAAI Conf. Artificial Intelligence, 24(1):1607–1612.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Schulman J, Chen X, Abbeel P (2017a). Equivalence between policy gradients and soft Q-Learn. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
  • Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Proc. 32nd Conf. Machine Learn., PMLR, 37:1889–1897.Google Scholar
  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
  • Shani L, Efroni Y, Mannor S (2019) Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. Proc. AAAI Conf. Artificial Intelligence 34(4):5668–5675.Google Scholar
  • Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Google Scholar
  • Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement Learn. with function approximation. NIPS’99, 1057–1063.Google Scholar
  • Tu S, Recht B (2019) The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. Proc. Thirty-Second Conf. Learn. Theory, PMLR, 99:3036–3083.Google Scholar
  • Vieillard N, Kozuno T, Scherrer B, Pietquin O, Munos R, Geist M (2020) Leverage the average: an analysis of KL regularization in RL. Preprint, submitted March 31, https://arxiv.org/abs/2003.14089.Google Scholar
  • Wang L, Cai Q, Yang Z, Wang Z (2019) Neural policy gradient methods: Global optimality and rates of convergence. Preprint, submitted August 29, https://arxiv.org/abs/1909.01150.Google Scholar
  • Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement Learn. Machine Learn. 8(3-4):229–256.CrossrefGoogle Scholar
  • Williams RJ, Peng J (1991) Function optimization using connectionist reinforcement Learn. algorithms. Connect. Sci. 3(3):241–268.CrossrefGoogle Scholar
  • Wu Y, Zhang W, Xu P, Gu Q (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
  • Xiao C, Huang R, Mei J, Schuurmans D, Müller M (2019) Maximum entropy Monte-Carlo planning. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 9520–9528.Google Scholar
  • Xu T, Wang Z, Liang Y (2020) Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. Preprint, submitted May 7, https://arxiv.org/abs/2005.03557.Google Scholar
  • Zhang K, Hu B, Basar T (2019a) Policy optimization for ℋ2 linear control with ℋ∞ robustness guarantee: implicit regularization and global convergence. Proc. 2nd Conf. Learn. Dynam. Control, PMLR, 120:179–190.Google Scholar
  • Zhang K, Koppel A, Zhu H, Başar T (2019b). Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control Optim. 58(6):3586–3612.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.