The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

Published Online:https://doi.org/10.1287/opre.2022.0357

References

  • Arora S, Du S, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. 36th Internat. Conf. Machine Learning, ICML 2019 (International Machine Learning Society), 477–502.Google Scholar
  • Baxter J, Tridgell A, Weaver L (1999) TDleaf(lambda): Combining temporal difference learning with game-tree search. Clinical Orthopaedics Related Res.Google Scholar
  • Bertsekas D (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.CrossrefGoogle Scholar
  • Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas D (2021) Lessons from alphazero for optimal, model predictive, and adaptive control. Working paper, MIT, Cambridge, MA.Google Scholar
  • Bertsekas D, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Browne C, Powley E, Whitehouse D, Lucas S, Cowling P, Rohlfshagen P, Tavener S, Perez Liebana D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intelligence AI Games 4(1):1–43.CrossrefGoogle Scholar
  • Buşoniu L, Lazaric A, Ghavamzadeh M, Munos R, Babuška R, Schutter BD (2012) Least-squares methods for policy iteration. Wiering M, van Otterlo M, eds. Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12 (Springer, Berlin), 75–109.Google Scholar
  • Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv. Neural Inform. Processing Systems 32:10836–10846.Google Scholar
  • Deng H, Yin S, Deng X, Li S (2020) Value-based algorithms optimization with discounted multiple-step learning method in deep reinforcement learning. 2020 IEEE 22nd Internat. Conf. High Performance Comput. Comm. IEEE 18th Internat. Conf. Smart City IEEE 6th Internat. Conf. Data Sci. Systems (HPCC/SmartCity/DSS) (IEEE, Piscataway, NJ), 979–984.Google Scholar
  • Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Efroni Y, Ghavamzadeh M, Mannor S (2020) Online planning with lookahead policies. Adv. Neural Inform. Processing Systems 33.Google Scholar
  • Efroni Y, Dalal G, Scherrer B, Mannor S (2018a) Beyond the one step greedy approach in reinforcement learning. Preprint, submitted July 30, https://arxiv.org/abs/1802.03654.Google Scholar
  • Efroni Y, Dalal G, Scherrer B, Mannor S (2018b) Multiple-step greedy policies in online and approximate reinforcement learning. NIPS’18: Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5244–5253.Google Scholar
  • Efroni Y, Dalal G, Scherrer B, Mannor S (2019) How to combine tree-search methods in reinforcement learning. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI-19) (AAAI Press, Palo Alto, CA), 3494–3501.Google Scholar
  • Hong ZW, Pajarinen J, Peters J (2019) Model-based lookahead reinforcement learning. Preprint, submitted August 15, https://arxiv.org/pdf/1908.06012.Google Scholar
  • Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Preprint, submitted June 20, https://arxiv.org/abs/1806.07572.Google Scholar
  • Ji Z, Telgarsky M (2019) Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLu networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. Fürnkranz J, Scheffer T, Spiliopoulou M, eds. Machine Learning: ECML 2006, Lecture Notes in Computer Science, vol. 4212 (Springer, Berlin), 282–293.CrossrefGoogle Scholar
  • Lagoudakis MG, Parr R (2001) Model-free least-squares policy iteration. Adv. Neural Inform. Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
  • Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
  • Lanctot M, Winands MHM, Pepels T, Sturtevant NR (2014) Monte Carlo tree search with heuristic evaluations using implicit minimax backups. 2014 IEEE Conf. Comput. Intelligence Games (IEEE, Piscataway, NJ), 1–8.Google Scholar
  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Preprint, submitted June 16, https://arxiv.org/abs/1602.01783.Google Scholar
  • Moerland TM, Broekens J, Jonker CM (2020) A framework for reinforcement learning and planning. Working paper, AAAI Press, Palo Alto, CA.Google Scholar
  • Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.CrossrefGoogle Scholar
  • Puterman M, Shin MC (1978) Modified policy iteration algorithms for discounted Markov decision problems. Management Sci. 24(11):1127–1137.LinkGoogle Scholar
  • Shah D, Xie Q, Xu Z (2020a) Non-asymptotic analysis of Monte Carlo tree search. ACM SIGMETRICS Performance Evaluation Review, vol. 48 (ACM, New York), 31–32.Google Scholar
  • Shah D, Somani V, Xie Q, Xu Z (2020b) On reinforcement learning for turn-based zero-sum Markov games. Preprint, submitted February 25, https://arxiv.org/abs/2002.10620.Google Scholar
  • Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, et al. (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint, submitted December 5, https://arxiv.org/abs/1712.01815.Google Scholar
  • Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017b) Mastering the game of go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • Springenberg JT, Heess N, Mankowitz D, Merel J, Byravan A, Abdolmaleki A, Kay J, et al. (2020) Local search for policy iteration in continuous control. Preprint, submitted October 12, https://arxiv.org/abs/2010.05545.Google Scholar
  • Świechowski M, Godlewski K, Sawicki B, Mańdziuk J (2023) Monte Carlo tree search: A review of recent modifications and applications. Artificial Intelligence Rev. 56(3):2497–2562.CrossrefGoogle Scholar
  • Tomar M, Efroni Y, Ghavamzadeh M (2020) Multi-step greedy reinforcement learning algorithms. Internat. Conf. Machine Learning (PMLR, New York), 9504–9513.Google Scholar
  • Tsitsiklis JN, Roy BV (1994) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1):59–94.CrossrefGoogle Scholar
  • Veness J, Silver D, Blair A, Uther W (2009) Bootstrapping from game tree search. Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1937–1945.Google Scholar
  • Winnicki A, Srikant R (2022) Reinforcement learning with unbiased policy evaluation and linear function approximation. 2022 IEEE 61st Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 801–806.Google Scholar
  • Winnicki A, Srikant R (2023) A new policy iteration algorithm for reinforcement learning in zero-sum Markov games. Preprint, submitted March 17, https://arxiv.org/abs/2303.09716.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.