An Efficient Node Selection Policy for Monte Carlo Tree Search with Neural Networks

INFORMS Journal on Computing Meritorious Paper Awardee

Xiaotian Liu
Xiaotian Liu
[email protected]
https://orcid.org/0009-0009-7018-8030
PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China
Search for more papers by this author
,
Yijie Peng
Corresponding Author
Yijie Peng
[email protected]
https://orcid.org/0000-0003-2584-8131
PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China
Search for more papers by this author
,
Gongbo Zhang
Corresponding Author
Gongbo Zhang
[email protected]
https://orcid.org/0000-0002-7517-7666
PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China
Search for more papers by this author
,
Ruihan Zhou
Ruihan Zhou
[email protected]
https://orcid.org/0009-0001-1498-7702
PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China
Search for more papers by this author

PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China

Search for more papers by this author

Yijie Peng

Corresponding Author

Yijie Peng

[email protected]

https://orcid.org/0000-0003-2584-8131

PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China

Search for more papers by this author

Gongbo Zhang

Corresponding Author

Gongbo Zhang

[email protected]

https://orcid.org/0000-0002-7517-7666

PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China

Search for more papers by this author

Ruihan Zhou

[email protected]

https://orcid.org/0009-0001-1498-7702

PKU-Wuhan Institute for Artificial Intelligence, Guanghua School of Management, Peking University, Beijing 100871, China; and Xiangjiang Laboratory, Changsha 410000, China

Search for more papers by this author

Published Online:18 Sep 2024https://doi.org/10.1287/ijoc.2023.0307

References

Audibert JY, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Tauman Kalai A, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory (COLT 2010), vol. 13 (OmniPress, Madison, WI), 41–53.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.Crossref, Google Scholar
Bellman R (1954) The theory of dynamic programming. Bull. Amer. Math. Soc. 60(6):503–515.Crossref, Google Scholar
Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Gavaldà R, Lugosi G, Zeugmann T, Zilles S, eds. Algorithmic Learn. Theory ALT 2009, Lecture Notes in Computer Science, vol. 5809 (Springer, Berlin, Heidelberg), 23–37.Crossref, Google Scholar
Chang HS, Fu MC, Hu J, Marcus SI (2005) An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1):126–139.Link, Google Scholar
Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynam. Systems 10:251–270.Crossref, Google Scholar
Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Ingalls RG, Rossetti MD, Smith JS, Peter BA, eds. Proc. 2004 Winter Simulation Conf. (WSC), vol. 1 (IEEE, Piscataway, NJ), 577–585.Google Scholar
Kaufmann E, Koolen WM (2017) Monte-Carlo tree search by best arm identification. Adv. Neural Inform. Processing Systems 30:1–10.Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. Fürnkranz J, Scheffer T, Spiliopoulou M, eds. Machine Learn. ECML 2006, Lecture Notes in Computer Science, vol. 4212 (Springer, Berlin, Heidelberg), 282–293.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, MA).Crossref, Google Scholar
Li Y, Fu MC, Xu J (2021) An optimal computing budget allocation tree policy for Monte Carlo tree search. IEEE Trans. Automatic Control 67(6):2685–2699.Crossref, Google Scholar
Li H, Lam H, Peng Y (2024) Efficient learning for clustering and optimizing context-dependent designs. Oper. Res. 72(2):617–638.Link, Google Scholar
Liu X, Lam H, Peng Y (2024a) Training deep Q-network via Monte Carlo tree search for adaptive bitrate control in video delivery. Preprint, submitted March 13, http://dx.doi.org/10.2139/ssrn.4757662.Google Scholar
Liu X, Peng Y, Zhang G, Zhou R (2024b) An efficient node selection policy for Monte Carlo tree search with neural networks. http://dx.doi.org/10.1287/ijoc.2023.0307.cd, https://github.com/INFORMSJoC/2023.0307.Google Scholar
Loeffler TD, Banik S, Patra TK, Sternberg M, Sankaranarayanan SK (2021) Reinforcement learning in discrete action space applied to inverse defect design. J. Phys. Comm. 5(3):031001.Crossref, Google Scholar
Mansley C, Weinstein A, Littman M (2011) Sample-based planning for continuous action Markov decision processes. Proc. Internat. Conf. Automated Planning Scheduling, vol. 21 (Association for the Advancement of Artificial Intelligence, Washington, DC), 335–338.Google Scholar
OpenAI Gym (2021) OpenAI gym cartpole-v0. Accessed September 29, 2021, https://github.com/openai/gym/wiki/CartPole-v0.Google Scholar
Peng Y, Zhang G (2022) Thompson sampling meets ranking and selection. Feng B, Pedrielli G, Peng Y, Shashaani S, Song E, Corlu CG, Lee LH, Chew EP, Roeder T, Lendermann P, eds. Proc. 2022 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 3075–3086.Google Scholar
Peng Y, Chong EK, Chen CH, Fu MC (2018) Ranking and selection as stochastic control. IEEE Trans. Automatic Control 63(8):2359–2373.Crossref, Google Scholar
Russo D (2020) Simple Bayesian algorithms for best-arm identification. Oper. Res. 68(6):1625–1647.Link, Google Scholar
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A Tutorial on Thompson Sampling, Foundations and Trends in Machine Learning, vol. 11 (now Publishers, Hanover, MA), 1–96.Crossref, Google Scholar
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, et al. (2020) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839):604–609.Crossref, Google Scholar
Shin D, Broadie M, Zeevi A (2018) Tractable sampling strategies for ordinal optimization. Oper. Res. 66(6):1693–1712.Link, Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
Teraoka K, Hatano K, Takimoto E (2014) Efficient sampling method for Monte Carlo tree search problem. IEICE Trans. Inform. Systems 97(3):392–398.Crossref, Google Scholar
Tesauro G, Rajan V, Segal R (2012) Bayesian inference in Monte-Carlo tree search. Preprint, submitted March 15, https://arxiv.org/abs/1203.3519.Google Scholar
Yin ZH, Ye W, Chen Q, Gao Y (2022) Planning for sample efficient imitation learning. Adv. Neural Inform. Processing Systems 35:2577–2589.Google Scholar
Zhang G, Li H, Peng Y (2020) Sequential sampling for a ranking and selection problem with exponential sampling distributions. Bae K-H, Feng B, Kim S, Lazarova-Molnar S, Zheng Z, Roeder T, Thiesing R, eds. Proc. 2020 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 2984–2995.Google Scholar
Zhang G, Liu X, Peng Y (2023a) A simulation optimization method for scheduling automated guided vehicles in a stochastic warehouse management system. Corlu CG, Hunter SR, Lam H, Onggo BS, Shortle J, Biller B, eds. Proc. 2023 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 3238–3249.Google Scholar
Zhang G, Peng Y, Xu Y (2022) An efficient dynamic sampling policy for Monte Carlo tree search. Feng B, Pedrielli G, Peng Y, Shashaani S, Song E, Corlu CG, Lee LH, Chew EP, Roeder T, Lendermann P, eds. Proc. 2022 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ) 2760–2771.Google Scholar
Zhang G, Chen S, Huang K, Peng Y (2024) Efficient learning for selecting top-m context-dependent designs. IEEE Trans. Automation Sci. Engrg., ePub ahead of print April 29, https://doi.org/10.1109/TASE.2024.3391020.Google Scholar
Zhang G, Chen B, Jia QS, Peng Y (2023b) Efficient sampling policy for selecting a subset with the best. IEEE Trans. Automatic Control 68(8):4904–4911.Crossref, Google Scholar
Zhang G, Peng Y, Zhang J, Zhou E (2023c) Asymptotically optimal sampling policy for selecting top-m alternatives. INFORMS J. Comput. 35(6):1261–1285.Link, Google Scholar

cover image INFORMS Journal on Computing

Volume 37, Issue 4

July-August 2025

Pages iv-viii, 785-1141, iii

Article Information

Supplemental Material

Metrics

Information

Received:August 29, 2023
Accepted:August 08, 2024
Published Online:September 18, 2024

Cite as

Xiaotian Liu, Yijie Peng, Gongbo Zhang, Ruihan Zhou (2024) An Efficient Node Selection Policy for Monte Carlo Tree Search with Neural Networks. INFORMS Journal on Computing 37(4):785-807.

https://doi.org/10.1287/ijoc.2023.0307

Keywords

Acknowledgments

An earlier version of this paper was presented in part at the Proceedings of 2022 Winter Simulation Conference (2022 WSC) (Zhang et al. 2022).

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

An Efficient Node Selection Policy for Monte Carlo Tree Search with Neural Networks

References

Volume 37, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News