Open Access

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Shicong Cen
Corresponding Author
Shicong Cen
[email protected]
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;
Search for more papers by this author
,
Chen Cheng
Chen Cheng
[email protected]
Department of Statistics, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Yuxin Chen
Yuxin Chen
[email protected]
https://orcid.org/0000-0001-9256-5815
Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;
Search for more papers by this author
,
Yuting Wei
Yuting Wei
[email protected]
https://orcid.org/0000-0002-3041-3434
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Search for more papers by this author
,
Yuejie Chi
Yuejie Chi
[email protected]
https://orcid.org/0000-0002-6766-5459
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;
Search for more papers by this author

Shicong Cen

Corresponding Author

Shicong Cen

[email protected]

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;

Search for more papers by this author

Chen Cheng

[email protected]

Department of Statistics, Stanford University, Stanford, California 94305;

Search for more papers by this author

Yuxin Chen

[email protected]

https://orcid.org/0000-0001-9256-5815

Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544;

Search for more papers by this author

Yuting Wei

[email protected]

https://orcid.org/0000-0002-3041-3434

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104

Search for more papers by this author

Yuejie Chi

[email protected]

https://orcid.org/0000-0002-6766-5459

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;

Search for more papers by this author

Published Online:2 Dec 2021https://doi.org/10.1287/opre.2021.2151

References

Agarwal A, Jiang N, Kakade SM (2019) Reinforcement Learn.: Theory and algorithms. Technical report, University of Washington Seattle, Seattle, WA. Google Scholar
Agarwal A, Kakade S, Yang LF (2020a) Model-based reinforcement Learn. with a generative model is minimax optimal. Proc. 33rd Conf. Learn. Theory, 67–83.Google Scholar
Agarwal A, Kakade SM, Lee JD, Mahajan G (2020b) Optimality and approximation with policy gradient methods in Markov decision processes. Proc. 33rd Conf. Learn. Theory, 64–66.Google Scholar
Ahmed Z, Le Roux N, Norouzi M, Schuurmans D (2019) Understanding the impact of entropy on policy optimization. Proc. 36th Internat. Conf. Machine Learn., 151–160.Google Scholar
Amari SI (1998) Natural gradient works efficiently in Learn.. Neural Comput. 10(2):251–276.Crossref, Google Scholar
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement Learn. with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716.Crossref, Google Scholar
Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
Bhandari J, Russo D (2020) A note on the linear convergence of policy gradient methods. Preprint, submitted July 21, https://arxiv.org/abs/2007.11120.Google Scholar
Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica J. IFAC. 45(11):2471–2482.Crossref, Google Scholar
Cai Q, Yang Z, Jin C, Wang Z (2019) Provably efficient exploration in policy optimization. Proc. 37th Conf. Machine Learn., PMLR 119:1283–1294.Google Scholar
Cen S, Wei Y, Chi Y (2021) Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. arXiv preprint arXiv:2105.15186.Google Scholar
Dai B, Shaw A, Li L, Xiao L, He N, Liu Z, Chen J, Song L (2018) SBEED: Convergent reinforcement Learn. with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1125–1134.Google Scholar
Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement Learn. for continuous control. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1329–1338.Google Scholar
Even-Dar E, Kakade SM, Mansour Y (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.Link, Google Scholar
Fazel M, Ge R, Kakade S, Mesbahi M (2018) Global convergence of policy gradient methods for the linear quadratic regulator. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1467–1476.Google Scholar
Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. Internat. Conf. Machine Learn., 2160–2169.Google Scholar
Grill JB, Darwiche Domingues O, Menard P, Munos R, Valko M (2019) Planning in entropy-regularized markov decision processes and games. Advances in Neural Information Processing Systems 32:12404–12413.Google Scholar
Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement Learn. with deep energy-based policies. Proc. 34th Internat. Conf. Machine Learn., PMLR, 70:1352–1361.Google Scholar
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement Learn. with a stochastic actor. Proc. 35th Internat. Conf. Machine Learn., PMLR, 80:1861–1870.Google Scholar
Hazan E, Kakade S, Singh K, Van Soest A (2019) Provably efficient maximum entropy exploration. Proc. 36th Internat. Conf. Machine Learn., PMLR, 97:2681–2691.Google Scholar
Jansch-Porto JP, Hu B, Dullerud G (2020) Convergence guarantees of policy optimization methods for Markovian jump linear systems. arXiv preprint arXiv:2002.04090.Google Scholar
Kakade SM (2002) A natural policy gradient. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement Learn. Proc. 19th Internat. Conf. Machine Learn., 267–274.Google Scholar
Karimi B, Miasojedow B, Moulines É, Wai HT (2019) Non-asymptotic analysis of biased stochastic approximation scheme. Proc. Thirty-Second Conf. Learn. Theory, 99:1944–1974.Google Scholar
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA) 1008–1014.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020a) Breaking the sample size barrier in model-based reinforcement Learn. with a generative model. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 12861–12872.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020b) Sample complexity of asynchronous Q-Learn.: Sharper analysis and variance reduction. Preprint, submitted Jun 4, https://arxiv.org/abs/2006.03041.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2021b) Softmax policy gradient methods can take exponential time to converge. Belkin M, Kpotufe S, eds. Proc. 34th Conf. Learning Theory (PMLR), 134:3107–3110.Google Scholar
Li G, Cai C, Chen Y, Gu Y, Wei Y, Chi Y (2021a) Is Q-Learn. minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.Google Scholar
Liu B, Cai Q, Yang Z, Wang Z (2019). Neural proximal trust region policy optimization attains globally optimal policy. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada.Google Scholar
Mei J, Xiao C, Szepesvari C, Schuurmans D, (2020) On the global convergence rates of softmaz policy gradient methods. Proc. 37th Internat. Conf. Machine Learn., PMLR, 119:6820–6829.Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016). Asynchronous methods for deep reinforcement Learn. Proc. 33rd Internat. Conf. Machine Learn., PMLR, 48:1928–1937.Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement Learn. Nature 518(7540):529–533.Crossref, Google Scholar
Mohammadi H, Zare A, Soltanolkotabi M, Jovanović MR (2019). Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. Preprint, submitted December 26, https://arxiv.org/abs/1912.11899.Google Scholar
Nachum O, Norouzi M, Xu K, Schuurmans D (2017). Bridging the gap between value and policy based reinforcement Learn.. Preprint, submitted November 22, https://arxiv.org/abs/1702.08892.Google Scholar
Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. (J. Wiley & Sons).Google Scholar
Nesterov Y (2009) Primal-dual subgradient methods for convex problems. Math. Programming 120(1):221–259.Crossref, Google Scholar
Neu G, Jonsson A, Gómez V (2017). A unified view of entropy-regularized Markov decision processes. Preprint, submitted May 22, https://arxiv.org/abs/1705.07798.Google Scholar
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190.Crossref, Google Scholar
Peters J, Mulling K, Altun Y (2010) Relative entropy policy search. Proc. AAAI Conf. Artificial Intelligence, 24(1):1607–1612.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. (John Wiley & Sons, Hoboken, NJ).Google Scholar
Schulman J, Chen X, Abbeel P (2017a). Equivalence between policy gradients and soft Q-Learn. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Proc. 32nd Conf. Machine Learn., PMLR, 37:1889–1897.Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
Shani L, Efroni Y, Mannor S (2019) Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. Proc. AAAI Conf. Artificial Intelligence 34(4):5668–5675.Google Scholar
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Google Scholar
Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement Learn. with function approximation. NIPS’99, 1057–1063.Google Scholar
Tu S, Recht B (2019) The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. Proc. Thirty-Second Conf. Learn. Theory, PMLR, 99:3036–3083.Google Scholar
Vieillard N, Kozuno T, Scherrer B, Pietquin O, Munos R, Geist M (2020) Leverage the average: an analysis of KL regularization in RL. Preprint, submitted March 31, https://arxiv.org/abs/2003.14089.Google Scholar
Wang L, Cai Q, Yang Z, Wang Z (2019) Neural policy gradient methods: Global optimality and rates of convergence. Preprint, submitted August 29, https://arxiv.org/abs/1909.01150.Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement Learn. Machine Learn. 8(3-4):229–256.Crossref, Google Scholar
Williams RJ, Peng J (1991) Function optimization using connectionist reinforcement Learn. algorithms. Connect. Sci. 3(3):241–268.Crossref, Google Scholar
Wu Y, Zhang W, Xu P, Gu Q (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
Xiao C, Huang R, Mei J, Schuurmans D, Müller M (2019) Maximum entropy Monte-Carlo planning. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 9520–9528.Google Scholar
Xu T, Wang Z, Liang Y (2020) Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. Preprint, submitted May 7, https://arxiv.org/abs/2005.03557.Google Scholar
Zhang K, Hu B, Basar T (2019a) Policy optimization for ℋ2 linear control with ℋ∞ robustness guarantee: implicit regularization and global convergence. Proc. 2nd Conf. Learn. Dynam. Control, PMLR, 120:179–190.Google Scholar
Zhang K, Koppel A, Zhu H, Başar T (2019b). Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control Optim. 58(6):3586–3612.Crossref, Google Scholar

Volume 70, Issue 4

July-August 2022

Pages iii-vii, 1953-2596, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:August 05, 2020
Accepted:April 06, 2021
Published Online:December 02, 2021

Cite as

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, Yuejie Chi (2021) Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. Operations Research 70(4):2563-2578.

https://doi.org/10.1287/opre.2021.2151

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

References

Volume 70, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News