An Efficient Node Selection Policy for Monte Carlo Tree Search with Neural Networks

INFORMS Journal on Computing Meritorious Paper Awardee
Published Online:https://doi.org/10.1287/ijoc.2023.0307

References

  • Audibert JY, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Tauman Kalai A, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory (COLT 2010), vol. 13 (OmniPress, Madison, WI), 41–53.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.CrossrefGoogle Scholar
  • Bellman R (1954) The theory of dynamic programming. Bull. Amer. Math. Soc. 60(6):503–515.CrossrefGoogle Scholar
  • Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Gavaldà R, Lugosi G, Zeugmann T, Zilles S, eds. Algorithmic Learn. Theory ALT 2009, Lecture Notes in Computer Science, vol. 5809 (Springer, Berlin, Heidelberg), 23–37.CrossrefGoogle Scholar
  • Chang HS, Fu MC, Hu J, Marcus SI (2005) An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1):126–139.LinkGoogle Scholar
  • Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynam. Systems 10:251–270.CrossrefGoogle Scholar
  • Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Ingalls RG, Rossetti MD, Smith JS, Peter BA, eds. Proc. 2004 Winter Simulation Conf. (WSC), vol. 1 (IEEE, Piscataway, NJ), 577–585.Google Scholar
  • Kaufmann E, Koolen WM (2017) Monte-Carlo tree search by best arm identification. Adv. Neural Inform. Processing Systems 30:1–10.Google Scholar
  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
  • Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. Fürnkranz J, Scheffer T, Spiliopoulou M, eds. Machine Learn. ECML 2006, Lecture Notes in Computer Science, vol. 4212 (Springer, Berlin, Heidelberg), 282–293.Google Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, MA).CrossrefGoogle Scholar
  • Li Y, Fu MC, Xu J (2021) An optimal computing budget allocation tree policy for Monte Carlo tree search. IEEE Trans. Automatic Control 67(6):2685–2699.CrossrefGoogle Scholar
  • Li H, Lam H, Peng Y (2024) Efficient learning for clustering and optimizing context-dependent designs. Oper. Res. 72(2):617–638.LinkGoogle Scholar
  • Liu X, Lam H, Peng Y (2024a) Training deep Q-network via Monte Carlo tree search for adaptive bitrate control in video delivery. Preprint, submitted March 13, http://dx.doi.org/10.2139/ssrn.4757662.Google Scholar
  • Liu X, Peng Y, Zhang G, Zhou R (2024b) An efficient node selection policy for Monte Carlo tree search with neural networks. http://dx.doi.org/10.1287/ijoc.2023.0307.cd, https://github.com/INFORMSJoC/2023.0307.Google Scholar
  • Loeffler TD, Banik S, Patra TK, Sternberg M, Sankaranarayanan SK (2021) Reinforcement learning in discrete action space applied to inverse defect design. J. Phys. Comm. 5(3):031001.CrossrefGoogle Scholar
  • Mansley C, Weinstein A, Littman M (2011) Sample-based planning for continuous action Markov decision processes. Proc. Internat. Conf. Automated Planning Scheduling, vol. 21 (Association for the Advancement of Artificial Intelligence, Washington, DC), 335–338.Google Scholar
  • OpenAI Gym (2021) OpenAI gym cartpole-v0. Accessed September 29, 2021, https://github.com/openai/gym/wiki/CartPole-v0.Google Scholar
  • Peng Y, Zhang G (2022) Thompson sampling meets ranking and selection. Feng B, Pedrielli G, Peng Y, Shashaani S, Song E, Corlu CG, Lee LH, Chew EP, Roeder T, Lendermann P, eds. Proc. 2022 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 3075–3086.Google Scholar
  • Peng Y, Chong EK, Chen CH, Fu MC (2018) Ranking and selection as stochastic control. IEEE Trans. Automatic Control 63(8):2359–2373.CrossrefGoogle Scholar
  • Russo D (2020) Simple Bayesian algorithms for best-arm identification. Oper. Res. 68(6):1625–1647.LinkGoogle Scholar
  • Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A Tutorial on Thompson Sampling, Foundations and Trends in Machine Learning, vol. 11 (now Publishers, Hanover, MA), 1–96.CrossrefGoogle Scholar
  • Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, et al. (2020) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839):604–609.CrossrefGoogle Scholar
  • Shin D, Broadie M, Zeevi A (2018) Tractable sampling strategies for ordinal optimization. Oper. Res. 66(6):1693–1712.LinkGoogle Scholar
  • Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • Teraoka K, Hatano K, Takimoto E (2014) Efficient sampling method for Monte Carlo tree search problem. IEICE Trans. Inform. Systems 97(3):392–398.CrossrefGoogle Scholar
  • Tesauro G, Rajan V, Segal R (2012) Bayesian inference in Monte-Carlo tree search. Preprint, submitted March 15, https://arxiv.org/abs/1203.3519.Google Scholar
  • Yin ZH, Ye W, Chen Q, Gao Y (2022) Planning for sample efficient imitation learning. Adv. Neural Inform. Processing Systems 35:2577–2589.Google Scholar
  • Zhang G, Li H, Peng Y (2020) Sequential sampling for a ranking and selection problem with exponential sampling distributions. Bae K-H, Feng B, Kim S, Lazarova-Molnar S, Zheng Z, Roeder T, Thiesing R, eds. Proc. 2020 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 2984–2995.Google Scholar
  • Zhang G, Liu X, Peng Y (2023a) A simulation optimization method for scheduling automated guided vehicles in a stochastic warehouse management system. Corlu CG, Hunter SR, Lam H, Onggo BS, Shortle J, Biller B, eds. Proc. 2023 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 3238–3249.Google Scholar
  • Zhang G, Peng Y, Xu Y (2022) An efficient dynamic sampling policy for Monte Carlo tree search. Feng B, Pedrielli G, Peng Y, Shashaani S, Song E, Corlu CG, Lee LH, Chew EP, Roeder T, Lendermann P, eds. Proc. 2022 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ) 2760–2771.Google Scholar
  • Zhang G, Chen S, Huang K, Peng Y (2024) Efficient learning for selecting top-m context-dependent designs. IEEE Trans. Automation Sci. Engrg., ePub ahead of print April 29, https://doi.org/10.1109/TASE.2024.3391020.Google Scholar
  • Zhang G, Chen B, Jia QS, Peng Y (2023b) Efficient sampling policy for selecting a subset with the best. IEEE Trans. Automatic Control 68(8):4904–4911.CrossrefGoogle Scholar
  • Zhang G, Peng Y, Zhang J, Zhou E (2023c) Asymptotically optimal sampling policy for selecting top-m alternatives. INFORMS J. Comput. 35(6):1261–1285.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.