Approximation Benefits of Policy Gradient Methods with Aggregated States

Daniel Russo
Daniel Russo
[email protected]
https://orcid.org/0000-0001-5926-8624
Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author

Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:28 Jun 2023https://doi.org/10.1287/mnsc.2023.4788

References

Abel D, Hershkowitz DE, Littman ML (2016) Near optimal behavior via approximate state abstraction. Internat. Conf. Machine Learn. (JMLR), 2915–2923.Google Scholar
Agarwal A, Jiang N, Kakade SM (2019a) Reinforcement learning: Theory and algorithms. Technical report. Accessed April 21, 2023, https://rltheorybook.github.io.Google Scholar
Agarwal A, Henaff M, Kakade S, Sun W (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Adv. Neural Inform. Processing Systems 33:13399–13412.Google Scholar
Agarwal A, Kakade SM, Lee JD, Mahajan G (2019b) Optimality and approximation with policy gradient methods in Markov decision processes. Preprint, submitted August 1, https://arxiv.org/abs/1908.00261.Google Scholar
Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
Bean JC, Birge JR, Smith RL (1987) Aggregation in dynamic programming. Oper. Res. 35(2):215–220.Link, Google Scholar
Beck A (2017) First-Order Methods in Optimization (SIAM-Society for Industrial and Applied Mathematics, Philadelphia).Crossref, Google Scholar
Bertsekas D, Castanon D (1989) Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Trans. Automatic Control 34(6):589–598.Crossref, Google Scholar
Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol. 2 (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334.Crossref, Google Scholar
Bertsekas DP (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
Bertsekas DP, Ioffe S (1996) Temporal differences-based policy iteration and applications in neuro-dynamic programming. Laboratory for Information and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA.Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
Bhandari J, Russo D (2021) On the linear convergence of policy gradient methods for finite MDPs. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2386–2394.Google Scholar
Cen S, Cheng C, Chen Y, Wei Y, Chi Y (2021) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.Link, Google Scholar
Davis D, Grimmer B (2019) Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3):1908–1930.Crossref, Google Scholar
Dean T, Givan R (1997) Model minimization in Markov decision processes. Proc. AAAI Conf. Artificial Intelligence (AAAI, Washington, DC), 106–111.Google Scholar
Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inform. Processing Systems 27:1646–1654.Google Scholar
Dong S, Van Roy B, Zhou Z (2019) Provably efficient reinforcement learning with aggregated states. Preprint, submitted December 13, https://arxiv.org/abs/1912.06366.Google Scholar
Duan Y, Ke T, Wang M (2019) State aggregation learning from Markov transition data. Adv. Neural Inform. Processing Systems 33:4486–4495.Google Scholar
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the ℓ1-ball for learning in high dimensions. Internat. Conf. Machine Learn. (JMLR), 272–279.Google Scholar
Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res. Logist. Quart. 3(1–2):95–110.Crossref, Google Scholar
Furmston T, Barber D (2012) A unifying perspective of parametric policy search methods for Markov decision processes. Adv. Neural Inform. Processing Systems 25:2717–2725.Google Scholar
Gabillon V, Ghavamzadeh M, Scherrer B (2013) Approximate dynamic programming finally performs well in the game of Tetris. Adv. Neural Inform. Processing Systems 26:1754–1762.Google Scholar
Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1–2):59–99.Crossref, Google Scholar
Gordon GJ (1995) Stable function approximation in dynamic programming. Machine Learn. Proc. (Morgan Kaufmann, San Francisco), 261–268.Crossref, Google Scholar
Jaggi M (2013) Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Internat. Conf. Machine Learn. (PMLR), 427–435.Google Scholar
Jiang N, Kulesza A, Singh S (2015) Abstraction selection in model-based reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 179–188.Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2, 267–274.Google Scholar
Kakade SM (2002) A natural policy gradient. Adv. Neural Inform. Processing Systems 14:1531–1538.Google Scholar
Khodadadian S, Jhunjhunwala PR, Varma SM, Maguluri ST (2021) On the linear convergence of natural policy gradient algorithm. 60th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 3794–3799.Google Scholar
Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 1008–1014.Google Scholar
Lazaric A, Ghavamzadeh M, Munos R (2012) Finite-sample analysis of least-squares policy iteration. J. Machine Learn. Res. 13:3041–3074.Google Scholar
Li L, Walsh TJ, Littman ML (2006) Toward a unified theory of state abstraction for MDPs. Proc. Ninth Intern. Sympos. Artificial Intelligence and Math. (ISAIM), 1–10.Google Scholar
Maxwell MS, Henderson SG, Topaloglu H (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.Link, Google Scholar
Mei J, Xiao C, Szepesvari C, Schuurmans D (2020) On the global convergence rates of softmax policy gradient methods. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 6961–6971.Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.Crossref, Google Scholar
Munos R (2003) Error bounds for approximate policy iteration. ICML, vol. 3, 560–567.Google Scholar
O’Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2017) Combining policy gradient and q-learning. Fifth Internat. Conf. Learn. Representations (Appleton, WI), 1–13.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
Reddi SJ, Sra S, Póczos B, Smola A (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. 54th Annual Allerton Conf. Comm. Control. Comput. (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
Reddi SJ, Sra S, Poczos B, Smola AJ (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv. Neural Inform. Processing Systems 29:1145–1153.Google Scholar
Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016c) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (JMLR), 314–323.Google Scholar
Rust J (1996) Numerical dynamic programming in economics. Amman H, Kendrik D, Rust J, eds. Handbook of Computational Economics, vol. 1 (Amsterdam), 619–729.Google Scholar
Rust J (1997) Using randomization to break the curse of dimensionality. Econometrica 65(3):487–516.Crossref, Google Scholar
Scherrer B (2014) Approximate policy iteration schemes: A comparison. Internat. Conf. Machine Learn. (JMLR), 1314–1322.Google Scholar
Scherrer B, Geist M (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer Berlin, Heidelberg), 35–50.Crossref, Google Scholar
Scherrer B, Lesner B (2012) On the use of non-stationary policies for stationary infinite-horizon Markov decision processes. Adv. Neural Inform. Processing Systems 25:1826–1834.Google Scholar
Schulman J, Chen X, Abbeel P (2017a) Equivalence between policy gradients and soft q-learning. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Internat. Conf. Machine Learn. (JMLR), 1889–1897.Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
Scott WR, Powell WB, Moazehi S (2014) Least squares policy iteration with instrumental variables vs. direct policy search: Comparison against optimal benchmarks using energy storage. Preprint, submitted January 4, https://arxiv.org/abs/1401.0843.Google Scholar
Shani L, Efroni Y, Mannor S (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. Proc. Conf. AAAI Artificial Intelligence, vol. 34, 5668–5675.Google Scholar
Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Adv. Neural Inform. Processing Systems 7:361–368.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inform. Processing Systems 12:1057–1063.Google Scholar
Szita I, Lörincz A (2006) Learning Tetris using the noisy cross-entropy method. Neural Comput. 18(12):2936–2941.Crossref, Google Scholar
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
Vieillard N, Pietquin O, Geist M (2019) On connections between constrained optimization and reinforcement learning. Preprint, submitted October 18, https://arxiv.org/abs/1910.08476.Google Scholar
Whitt W (1978) Approximations of dynamic programs, I. Math. Oper. Res. 3(3):231–243.Link, Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3):229–256.Crossref, Google Scholar
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.Crossref, Google Scholar
Zhang J, Kim J, O’Donoghue B, Boyd S (2021) Sample efficient reinforcement learning with reinforce. Proc. Conf. AAAI Artificial Intelligence, vol. 35, 10887–10895.Google Scholar

Volume 69, Issue 11

November 2023

Pages 6417-7150, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:January 07, 2021
Accepted:August 23, 2022
Published Online:June 28, 2023

Cite as

Daniel Russo (2023) Approximation Benefits of Policy Gradient Methods with Aggregated States. Management Science 69(11):6898-6911.

https://doi.org/10.1287/mnsc.2023.4788

Keywords

Acknowledgments

The author thanks Jalaj Bhandari for an earlier research collaboration that shaped the author’s thinking around policy gradient methods.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Approximation Benefits of Policy Gradient Methods with Aggregated States

References

Volume 69, Issue 11

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News