Approximation Benefits of Policy Gradient Methods with Aggregated States
References
- (2016) Near optimal behavior via approximate state abstraction. Internat. Conf. Machine Learn. (JMLR), 2915–2923.Google Scholar
- (2019a) Reinforcement learning: Theory and algorithms. Technical report. Accessed April 21, 2023, https://rltheorybook.github.io.Google Scholar
- (2020) PC-PG: Policy cover directed exploration for provable policy gradient learning. Adv. Neural Inform. Processing Systems 33:13399–13412.Google Scholar
- (2019b) Optimality and approximation with policy gradient methods in Markov decision processes. Preprint, submitted August 1, https://arxiv.org/abs/1908.00261.Google Scholar
- (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
- (1987) Aggregation in dynamic programming. Oper. Res. 35(2):215–220.Link, Google Scholar
- (2017) First-Order Methods in Optimization (SIAM-Society for Industrial and Applied Mathematics, Philadelphia).Crossref, Google Scholar
- (1989) Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Trans. Automatic Control 34(6):589–598.Crossref, Google Scholar
- (1995) Dynamic Programming and Optimal Control, vol. 2 (Athena Scientific, Belmont, MA).Google Scholar
- (1997) Nonlinear programming. J. Oper. Res. Soc. 48(3):334.Crossref, Google Scholar
- (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
- (1996) Temporal differences-based policy iteration and applications in neuro-dynamic programming. Laboratory for Information and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA.Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
- (2021) On the linear convergence of policy gradient methods for finite MDPs. Internat. Conf. Artificial Intelligence Statist. (PMLR), 2386–2394.Google Scholar
- (2021) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.Link, Google Scholar
- (2019) Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3):1908–1930.Crossref, Google Scholar
- (1997) Model minimization in Markov decision processes. Proc. AAAI Conf. Artificial Intelligence (AAAI, Washington, DC), 106–111.Google Scholar
- (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inform. Processing Systems 27:1646–1654.Google Scholar
- (2019) Provably efficient reinforcement learning with aggregated states. Preprint, submitted December 13, https://arxiv.org/abs/1912.06366.Google Scholar
- (2019) State aggregation learning from Markov transition data. Adv. Neural Inform. Processing Systems 33:4486–4495.Google Scholar
- (2008) Efficient projections onto the ℓ1-ball for learning in high dimensions. Internat. Conf. Machine Learn. (JMLR), 272–279.Google Scholar
- (1956) An algorithm for quadratic programming. Naval Res. Logist. Quart. 3(1–2):95–110.Crossref, Google Scholar
- (2012) A unifying perspective of parametric policy search methods for Markov decision processes. Adv. Neural Inform. Processing Systems 25:2717–2725.Google Scholar
- (2013) Approximate dynamic programming finally performs well in the game of Tetris. Adv. Neural Inform. Processing Systems 26:1754–1762.Google Scholar
- (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
- (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1–2):59–99.Crossref, Google Scholar
- (1995) Stable function approximation in dynamic programming. Machine Learn. Proc. (Morgan Kaufmann, San Francisco), 261–268.Crossref, Google Scholar
- (2013) Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Internat. Conf. Machine Learn. (PMLR), 427–435.Google Scholar
- (2015) Abstraction selection in model-based reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 179–188.Google Scholar
- (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2, 267–274.Google Scholar
- (2002) A natural policy gradient. Adv. Neural Inform. Processing Systems 14:1531–1538.Google Scholar
- (2021) On the linear convergence of natural policy gradient algorithm. 60th IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 3794–3799.Google Scholar
- (2000) Actor-critic algorithms. Adv. Neural Inform. Processing Systems 1008–1014.Google Scholar
- (2012) Finite-sample analysis of least-squares policy iteration. J. Machine Learn. Res. 13:3041–3074.Google Scholar
- (2006) Toward a unified theory of state abstraction for MDPs. Proc. Ninth Intern. Sympos. Artificial Intelligence and Math. (ISAIM), 1–10.Google Scholar
- (2013) Tuning approximate dynamic programming policies for ambulance redeployment via direct search. Stochastic Systems 3(2):322–361.Link, Google Scholar
- (2020) On the global convergence rates of softmax policy gradient methods. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
- (2020) Kinematic state abstraction and provably efficient rich-observation reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 6961–6971.Google Scholar
- (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.Crossref, Google Scholar
- (2003) Error bounds for approximate policy iteration. ICML, vol. 3, 560–567.Google Scholar
- (2017) Combining policy gradient and q-learning. Fifth Internat. Conf. Learn. Representations (Appleton, WI), 1–13.Google Scholar
- (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
- (2016a) Stochastic Frank-Wolfe methods for nonconvex optimization. 54th Annual Allerton Conf. Comm. Control. Comput. (IEEE, Piscataway, NJ), 1244–1251.Google Scholar
- (2016b) Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv. Neural Inform. Processing Systems 29:1145–1153.Google Scholar
- (2016c) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (JMLR), 314–323.Google Scholar
- (1996) Numerical dynamic programming in economics. Amman H, Kendrik D, Rust J, eds. Handbook of Computational Economics, vol. 1 (Amsterdam), 619–729.Google Scholar
- (1997) Using randomization to break the curse of dimensionality. Econometrica 65(3):487–516.Crossref, Google Scholar
- (2014) Approximate policy iteration schemes: A comparison. Internat. Conf. Machine Learn. (JMLR), 1314–1322.Google Scholar
- (2014) Local policy search in a convex space and conservative policy iteration as boosted policy search. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer Berlin, Heidelberg), 35–50.Crossref, Google Scholar
- (2012) On the use of non-stationary policies for stationary infinite-horizon Markov decision processes. Adv. Neural Inform. Processing Systems 25:1826–1834.Google Scholar
- (2017a) Equivalence between policy gradients and soft q-learning. Preprint, submitted April 21, https://arxiv.org/abs/1704.06440.Google Scholar
- (2015) Trust region policy optimization. Internat. Conf. Machine Learn. (JMLR), 1889–1897.Google Scholar
- (2017b) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
- (2014) Least squares policy iteration with instrumental variables vs. direct policy search: Comparison against optimal benchmarks using energy storage. Preprint, submitted January 4, https://arxiv.org/abs/1401.0843.Google Scholar
- (2020) Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. Proc. Conf. AAAI Artificial Intelligence, vol. 34, 5668–5675.Google Scholar
- (1995) Reinforcement learning with soft state aggregation. Adv. Neural Inform. Processing Systems 7:361–368.Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (2000) Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inform. Processing Systems 12:1057–1063.Google Scholar
- (2006) Learning Tetris using the noisy cross-entropy method. Neural Comput. 18(12):2936–2941.Crossref, Google Scholar
- (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
- (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
- (2019) On connections between constrained optimization and reinforcement learning. Preprint, submitted October 18, https://arxiv.org/abs/1910.08476.Google Scholar
- (1978) Approximations of dynamic programs, I. Math. Oper. Res. 3(3):231–243.Link, Google Scholar
- (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3):229–256.Crossref, Google Scholar
- (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4):2057–2075.Crossref, Google Scholar
- (2021) Sample efficient reinforcement learning with reinforce. Proc. Conf. AAAI Artificial Intelligence, vol. 35, 10887–10895.Google Scholar

