Markov Decision Processes, AlphaGo, and Monte Carlo Tree Search: Back to the Future

    Published Online:https://doi.org/10.1287/educ.2017.0166

    Abstract

    In 2016, a computer Go-playing program called AlphaGo stunned the (human) world by winning a match (4 games to 1) against the reigning human world champion, a feat more impressive than previous victories by computer programs in chess (Deep Blue) and the TV game show Jeopardy! (Watson). The main engine behind AlphaGo combines machine learning approaches in the form of deep neural networks with a technique called Monte Carlo tree search, whose roots can be traced back to an adaptive multistage sampling simulation-based algorithm for Markov decision processes (MDPs) published in Operations Research in 2005 [H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus. An adaptive sampling algorithm for solving Markov decision processes. Operations Research 53(1):126–139, 2005] (and introduced even earlier in 2002). This tutorial describes AlphaGo and the simulation-based MDP algorithm, as well as providing contextual and historical background material for both, and uses simple examples to illustrate the main ideas behind Monte Carlo tree search.

    Your Access Options

    Download PDF
    INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.