Approximation Benefits of Policy Gradient Methods with Aggregated States

Published Online:https://doi.org/10.1287/mnsc.2023.4788

References

  • Abel D, Hershkowitz DE, Littman ML (2016) Near optimal behavior via approximate state abstraction. Internat. Conf. Machine Learn. (JMLR), 2915–2923.Google Scholar
  • Agarwal A, Jiang N, Kakade SM (2019a) Reinforcement learning: Theory and algorithms. Technical report. Accessed April 21, 2023, https://rltheorybook.github.io.Google Scholar
  • Agarwal A, Henaff M, Kakade S, Sun W (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Adv. Neural Inform. Processing Systems 33:13399–13412.Google Scholar
  • Agarwal A, Kakade SM, Lee JD, Mahajan G (2019b) Optimality and approximation with policy gradient methods in Markov decision processes. Preprint, submitted August 1, https://arxiv.org/abs/1908.00261.Google Scholar
  • Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.CrossrefGoogle Scholar
  • Bean JC, Birge JR, Smith RL (1987) Aggregation in dynamic programming. Oper. Res. 35(2):215–220.LinkGoogle Scholar
  • Beck A (2017) First-Order Methods in Optimization (SIAM-Society for Industrial and Applied Mathematics, Philadelphia).CrossrefGoogle Scholar
  • Bertsekas D, Castanon D (1989) Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Trans. Automatic Control 34(6):589–598.CrossrefGoogle Scholar
  • Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol. 2 (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas DP (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334.CrossrefGoogle Scholar
  • Bertsekas DP (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.CrossrefGoogle Scholar
  • Bertsekas DP, Ioffe S (1996) Temporal differences-based policy iteration and applications in neuro-dynamic programming. Laboratory for Information and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA.Google Scholar
  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
  • Bhandari J, Russo D (2021) On the linear convergence of policy gradient methods for finite MDPs. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2386–2394.Google Scholar
  • Cen S, Cheng C, Chen Y, Wei Y, Chi Y (2021) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.LinkGoogle Scholar
  • Davis D, Grimmer B (2019) Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3):1908–1930.CrossrefGoogle Scholar
  • Dean T, Givan R (1997) Model minimization in Markov decision processes. Proc. AAAI Conf. Artificial Intelligence (AAAI, Washington, DC), 106–111.Google Scholar
  • Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inform. Processing Systems 27:1646–1654.Google Scholar
  • Dong S, Van Roy B, Zhou Z (2019) Provably efficient reinforcement learning with aggregated states. Preprint, submitted December 13, https://arxiv.org/abs/1912.06366.Google Scholar
  • Duan Y, Ke T, Wang M (2019) State aggregation learning from Markov transition data. Adv. Neural Inform. Processing Systems 33:4486–4495.Google Scholar
  • Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the ℓ1-ball for learning in high dimensions. Internat. Conf. Machine Learn. (JMLR), 272–279.Google Scholar
  • Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res. Logist. Quart. 3(1–2):95–110.CrossrefGoogle Scholar
  • Furmston T, Barber D (2012) A unifying perspective of parametric policy search methods for Markov decision processes. Adv. Neural Inform. Processing Systems 25:2717–2725.Google Scholar
  • Gabillon V, Ghavamzadeh M, Scherrer B (2013) Approximate dynamic programming finally performs well in the game of Tetris. Adv. Neural Inform. Processing Systems 26:1754–1762.Google Scholar
  • Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.CrossrefGoogle Scholar
  • Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1–2):59–99.CrossrefGoogle Scholar
  • Gordon GJ (1995) Stable function approximation in dynamic programming. Machine Learn. Proc. (Morgan Kaufmann, San Francisco), 261–268.CrossrefGoogle Scholar
  • Jaggi M (2013) Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Internat. Conf. Machine Learn. (PMLR), 427–435.Google Scholar
  • Jiang N, Kulesza A, Singh S (2015) Abstraction selection in model-based reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 179–188.Google Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2, 267–274.Google Scholar
  • Kakade SM (2002) A natural policy gradient. Adv. Neural Inform. Processing Systems 14:1531–1538.Google Scholar
  • Khodadadian S, Jhunjhunwala PR, Varma SM, Maguluri ST (2021) On the linear convergence of natural policy gradient algorithm. 60th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 3794–3799.Google Scholar
  • Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 1008–1014.Google Scholar
  • Lazaric A, Ghavamzadeh M, Munos R (2012) Finite-sample analysis of least-squares policy iteration. J. Machine Learn. Res. 13:3041–3074.Google Scholar
  • Li L, Walsh TJ, Littman ML (2006) Toward a unified theory of state abstraction for MDPs. Proc. Ninth Intern. Sympos. Artificial Intelligence and Math. (ISAIM), 1–10.Google Scholar
  • Maxwell MS, Henderson SG, Topaloglu H (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.LinkGoogle Scholar
  • Mei J, Xiao C, Szepesvari C, Schuurmans D (2020) On the global convergence rates of softmax policy gradient methods. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
  • Misra D, Henaff M, Krishnamurthy A, Langford J (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 6961–6971.Google Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.CrossrefGoogle Scholar
  • Munos R (2003) Error bounds for approximate policy iteration. ICML, vol. 3, 560–567.Google Scholar
  • O’Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2017) Combining policy gradient and q-learning. Fifth Internat. Conf. Learn. Representations (Appleton, WI), 1–13.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
  • Reddi SJ, Sra S, Póczos B, Smola A (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. 54th Annual Allerton Conf. Comm. Control. Comput. (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
  • Reddi SJ, Sra S, Poczos B, Smola AJ (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv. Neural Inform. Processing Systems 29:1145–1153.Google Scholar
  • Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016c) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (JMLR), 314–323.Google Scholar
  • Rust J (1996) Numerical dynamic programming in economics. Amman H, Kendrik D, Rust J, eds. Handbook of Computational Economics, vol. 1 (Amsterdam), 619–729.Google Scholar
  • Rust J (1997) Using randomization to break the curse of dimensionality. Econometrica 65(3):487–516.CrossrefGoogle Scholar
  • Scherrer B (2014) Approximate policy iteration schemes: A comparison. Internat. Conf. Machine Learn. (JMLR), 1314–1322.Google Scholar
  • Scherrer B, Geist M (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer Berlin, Heidelberg), 35–50.CrossrefGoogle Scholar
  • Scherrer B, Lesner B (2012) On the use of non-stationary policies for stationary infinite-horizon Markov decision processes. Adv. Neural Inform. Processing Systems 25:1826–1834.Google Scholar
  • Schulman J, Chen X, Abbeel P (2017a) Equivalence between policy gradients and soft q-learning. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
  • Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Internat. Conf. Machine Learn. (JMLR), 1889–1897.Google Scholar
  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
  • Scott WR, Powell WB, Moazehi S (2014) Least squares policy iteration with instrumental variables vs. direct policy search: Comparison against optimal benchmarks using energy storage. Preprint, submitted January 4, https://arxiv.org/abs/1401.0843.Google Scholar
  • Shani L, Efroni Y, Mannor S (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. Proc. Conf. AAAI Artificial Intelligence, vol. 34, 5668–5675.Google Scholar
  • Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Adv. Neural Inform. Processing Systems 7:361–368.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inform. Processing Systems 12:1057–1063.Google Scholar
  • Szita I, Lörincz A (2006) Learning Tetris using the noisy cross-entropy method. Neural Comput. 18(12):2936–2941.CrossrefGoogle Scholar
  • Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.CrossrefGoogle Scholar
  • Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.LinkGoogle Scholar
  • Vieillard N, Pietquin O, Geist M (2019) On connections between constrained optimization and reinforcement learning. Preprint, submitted October 18, https://arxiv.org/abs/1910.08476.Google Scholar
  • Whitt W (1978) Approximations of dynamic programs, I. Math. Oper. Res. 3(3):231–243.LinkGoogle Scholar
  • Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3):229–256.CrossrefGoogle Scholar
  • Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.CrossrefGoogle Scholar
  • Zhang J, Kim J, O’Donoghue B, Boyd S (2021) Sample efficient reinforcement learning with reinforce. Proc. Conf. AAAI Artificial Intelligence, vol. 35, 10887–10895.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.