The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

Anna Winnicki
Corresponding Author
Anna Winnicki
[email protected]
https://orcid.org/0000-0001-9880-2340
Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois 61801; and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author
,
Joseph Lubars
Joseph Lubars
[email protected]
https://orcid.org/0000-0001-9273-8456
Sandia National Laboratories, Albuquerque, New Mexico 87123
Search for more papers by this author
,
Michael Livesay
Michael Livesay
[email protected]
https://orcid.org/0000-0002-2594-3772
Sandia National Laboratories, Albuquerque, New Mexico 87123
Search for more papers by this author
,
R. Srikant
R. Srikant
[email protected]
https://orcid.org/0000-0003-1483-5204
Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois 61801; and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801; and c3.ai Digital Transformation Institute, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author

Anna Winnicki

Corresponding Author

Anna Winnicki

[email protected]

https://orcid.org/0000-0001-9880-2340

Search for more papers by this author

Joseph Lubars

[email protected]

https://orcid.org/0000-0001-9273-8456

Sandia National Laboratories, Albuquerque, New Mexico 87123

Search for more papers by this author

Michael Livesay

[email protected]

https://orcid.org/0000-0002-2594-3772

Sandia National Laboratories, Albuquerque, New Mexico 87123

Search for more papers by this author

R. Srikant

[email protected]

https://orcid.org/0000-0003-1483-5204

Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois 61801; and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Urbana, Illinois 61801; and c3.ai Digital Transformation Institute, University of Illinois Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Published Online:30 May 2024https://doi.org/10.1287/opre.2022.0357

References

Arora S, Du S, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. 36th Internat. Conf. Machine Learning, ICML 2019 (International Machine Learning Society), 477–502.Google Scholar
Baxter J, Tridgell A, Weaver L (1999) TDleaf(lambda): Combining temporal difference learning with game-tree search. Clinical Orthopaedics Related Res.Google Scholar
Bertsekas D (2011) Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3):310–335.Crossref, Google Scholar
Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas D (2021) Lessons from alphazero for optimal, model predictive, and adaptive control. Working paper, MIT, Cambridge, MA.Google Scholar
Bertsekas D, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Browne C, Powley E, Whitehouse D, Lucas S, Cowling P, Rohlfshagen P, Tavener S, Perez Liebana D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intelligence AI Games 4(1):1–43.Crossref, Google Scholar
Buşoniu L, Lazaric A, Ghavamzadeh M, Munos R, Babuška R, Schutter BD (2012) Least-squares methods for policy iteration. Wiering M, van Otterlo M, eds. Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12 (Springer, Berlin), 75–109.Google Scholar
Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv. Neural Inform. Processing Systems 32:10836–10846.Google Scholar
Deng H, Yin S, Deng X, Li S (2020) Value-based algorithms optimization with discounted multiple-step learning method in deep reinforcement learning. 2020 IEEE 22nd Internat. Conf. High Performance Comput. Comm. IEEE 18th Internat. Conf. Smart City IEEE 6th Internat. Conf. Data Sci. Systems (HPCC/SmartCity/DSS) (IEEE, Piscataway, NJ), 979–984.Google Scholar
Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Efroni Y, Ghavamzadeh M, Mannor S (2020) Online planning with lookahead policies. Adv. Neural Inform. Processing Systems 33.Google Scholar
Efroni Y, Dalal G, Scherrer B, Mannor S (2018a) Beyond the one step greedy approach in reinforcement learning. Preprint, submitted July 30, https://arxiv.org/abs/1802.03654.Google Scholar
Efroni Y, Dalal G, Scherrer B, Mannor S (2018b) Multiple-step greedy policies in online and approximate reinforcement learning. NIPS’18: Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5244–5253.Google Scholar
Efroni Y, Dalal G, Scherrer B, Mannor S (2019) How to combine tree-search methods in reinforcement learning. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI-19) (AAAI Press, Palo Alto, CA), 3494–3501.Google Scholar
Hong ZW, Pajarinen J, Peters J (2019) Model-based lookahead reinforcement learning. Preprint, submitted August 15, https://arxiv.org/pdf/1908.06012.Google Scholar
Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Preprint, submitted June 20, https://arxiv.org/abs/1806.07572.Google Scholar
Ji Z, Telgarsky M (2019) Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLu networks. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. Fürnkranz J, Scheffer T, Spiliopoulou M, eds. Machine Learning: ECML 2006, Lecture Notes in Computer Science, vol. 4212 (Springer, Berlin), 282–293.Crossref, Google Scholar
Lagoudakis MG, Parr R (2001) Model-free least-squares policy iteration. Adv. Neural Inform. Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
Lanctot M, Winands MHM, Pepels T, Sturtevant NR (2014) Monte Carlo tree search with heuristic evaluations using implicit minimax backups. 2014 IEEE Conf. Comput. Intelligence Games (IEEE, Piscataway, NJ), 1–8.Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Preprint, submitted June 16, https://arxiv.org/abs/1602.01783.Google Scholar
Moerland TM, Broekens J, Jonker CM (2020) A framework for reinforcement learning and planning. Working paper, AAAI Press, Palo Alto, CA.Google Scholar
Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.Crossref, Google Scholar
Puterman M, Shin MC (1978) Modified policy iteration algorithms for discounted Markov decision problems. Management Sci. 24(11):1127–1137.Link, Google Scholar
Shah D, Xie Q, Xu Z (2020a) Non-asymptotic analysis of Monte Carlo tree search. ACM SIGMETRICS Performance Evaluation Review, vol. 48 (ACM, New York), 31–32.Google Scholar
Shah D, Somani V, Xie Q, Xu Z (2020b) On reinforcement learning for turn-based zero-sum Markov games. Preprint, submitted February 25, https://arxiv.org/abs/2002.10620.Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, et al. (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint, submitted December 5, https://arxiv.org/abs/1712.01815.Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017b) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
Springenberg JT, Heess N, Mankowitz D, Merel J, Byravan A, Abdolmaleki A, Kay J, et al. (2020) Local search for policy iteration in continuous control. Preprint, submitted October 12, https://arxiv.org/abs/2010.05545.Google Scholar
Świechowski M, Godlewski K, Sawicki B, Mańdziuk J (2023) Monte Carlo tree search: A review of recent modifications and applications. Artificial Intelligence Rev. 56(3):2497–2562.Crossref, Google Scholar
Tomar M, Efroni Y, Ghavamzadeh M (2020) Multi-step greedy reinforcement learning algorithms. Internat. Conf. Machine Learning (PMLR, New York), 9504–9513.Google Scholar
Tsitsiklis JN, Roy BV (1994) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1):59–94.Crossref, Google Scholar
Veness J, Silver D, Blair A, Uther W (2009) Bootstrapping from game tree search. Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1937–1945.Google Scholar
Winnicki A, Srikant R (2022) Reinforcement learning with unbiased policy evaluation and linear function approximation. 2022 IEEE 61st Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 801–806.Google Scholar
Winnicki A, Srikant R (2023) A new policy iteration algorithm for reinforcement learning in zero-sum Markov games. Preprint, submitted March 17, https://arxiv.org/abs/2303.09716.Google Scholar

Volume 73, Issue 1

January-February 2025

Pages iii-vii, 1-582, C2-C3

Article Information

Metrics

Information

Received:July 12, 2022
Accepted:February 21, 2024
Published Online:May 30, 2024

Cite as

Anna Winnicki; , Joseph Lubars; , Michael Livesay; , R. Srikant (2024) The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation. Operations Research 73(1):139-156.

https://doi.org/10.1287/opre.2022.0357

Keywords

Acknowledgments

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration [Contract DE-NA0003525]. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the U.S. Government.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

References

Volume 73, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News