The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation
References
- (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. 36th Internat. Conf. Machine Learning, ICML 2019 (International Machine Learning Society), 477–502.Google Scholar
- (1999) TDleaf(lambda): Combining temporal difference learning with game-tree search. Clinical Orthopaedics Related Res.Google Scholar
- (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
- (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
- (2021) Lessons from alphazero for optimal, model predictive, and adaptive control. Working paper, MIT, Cambridge, MA.Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2012) A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intelligence AI Games 4(1):1–43.Crossref, Google Scholar
- (2012) Least-squares methods for policy iteration. Wiering M, van Otterlo M, eds. Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12 (Springer, Berlin), 75–109.Google Scholar
- (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv. Neural Inform. Processing Systems 32:10836–10846.Google Scholar
- (2020) Value-based algorithms optimization with discounted multiple-step learning method in deep reinforcement learning. 2020 IEEE 22nd Internat. Conf. High Performance Comput. Comm. IEEE 18th Internat. Conf. Smart City IEEE 6th Internat. Conf. Data Sci. Systems (HPCC/SmartCity/DSS) (IEEE, Piscataway, NJ), 979–984.Google Scholar
- (2019) Gradient descent provably optimizes over-parameterized neural networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2020) Online planning with lookahead policies. Adv. Neural Inform. Processing Systems 33.Google Scholar
- (2018a) Beyond the one step greedy approach in reinforcement learning. Preprint, submitted July 30, https://arxiv.org/abs/1802.03654.Google Scholar
- (2018b) Multiple-step greedy policies in online and approximate reinforcement learning. NIPS’18: Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5244–5253.Google Scholar
- (2019) How to combine tree-search methods in reinforcement learning. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI-19) (AAAI Press, Palo Alto, CA), 3494–3501.Google Scholar
- (2019) Model-based lookahead reinforcement learning. Preprint, submitted August 15, https://arxiv.org/pdf/1908.06012.Google Scholar
- (2018) Neural tangent kernel: Convergence and generalization in neural networks. Preprint, submitted June 20, https://arxiv.org/abs/1806.07572.Google Scholar
- (2019) Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLu networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2006) Bandit based Monte-Carlo planning. Fürnkranz J, Scheffer T, Spiliopoulou M, eds. Machine Learning: ECML 2006, Lecture Notes in Computer Science, vol. 4212 (Springer, Berlin), 282–293.Crossref, Google Scholar
- (2001) Model-free least-squares policy iteration. Adv. Neural Inform. Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
- (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
- (2014) Monte Carlo tree search with heuristic evaluations using implicit minimax backups. 2014 IEEE Conf. Comput. Intelligence Games (IEEE, Piscataway, NJ), 1–8.Google Scholar
- (2016) Asynchronous methods for deep reinforcement learning. Preprint, submitted June 16, https://arxiv.org/abs/1602.01783.Google Scholar
- (2020) A framework for reinforcement learning and planning. Working paper, AAAI Press, Palo Alto, CA.Google Scholar
- (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.Crossref, Google Scholar
- (1978) Modified policy iteration algorithms for discounted Markov decision problems. Management Sci. 24(11):1127–1137.Link, Google Scholar
- (2020a) Non-asymptotic analysis of Monte Carlo tree search. ACM SIGMETRICS Performance Evaluation Review, vol. 48 (ACM, New York), 31–32.Google Scholar
- (2020b) On reinforcement learning for turn-based zero-sum Markov games. Preprint, submitted February 25, https://arxiv.org/abs/2002.10620.Google Scholar
- (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint, submitted December 5, https://arxiv.org/abs/1712.01815.Google Scholar
- (2017b) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
- (2020) Local search for policy iteration in continuous control. Preprint, submitted October 12, https://arxiv.org/abs/2010.05545.Google Scholar
- (2023) Monte Carlo tree search: A review of recent modifications and applications. Artificial Intelligence Rev. 56(3):2497–2562.Crossref, Google Scholar
- (2020) Multi-step greedy reinforcement learning algorithms. Internat. Conf. Machine Learning (PMLR, New York), 9504–9513.Google Scholar
- (1994) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1):59–94.Crossref, Google Scholar
- (2009) Bootstrapping from game tree search. Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1937–1945.Google Scholar
- (2022) Reinforcement learning with unbiased policy evaluation and linear function approximation. 2022 IEEE 61st Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 801–806.Google Scholar
- (2023) A new policy iteration algorithm for reinforcement learning in zero-sum Markov games. Preprint, submitted March 17, https://arxiv.org/abs/2303.09716.Google Scholar

