Bayesian Exploration for Approximate Dynamic Programming

Ilya O. Ryzhov
Corresponding Author
Ilya O. Ryzhov
http://orcid.org/0000-0002-4191-084X
Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742;Institute for Systems Research, A. James Clark School of Engineering, University of Maryland, College Park, Maryland 20742;
Search for more papers by this author
,
Martijn R. K. Mes
Martijn R. K. Mes
Industrial Engineering and Business Information Systems, University of Twente, 7500 AE Enschede, Netherlands;
Search for more papers by this author
,
Warren B. Powell
Warren B. Powell
Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540
Search for more papers by this author
,
Gerald van den Berg
Gerald van den Berg
Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540
Search for more papers by this author

Corresponding Author

Ilya O. Ryzhov

Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742;Institute for Systems Research, A. James Clark School of Engineering, University of Maryland, College Park, Maryland 20742;

Search for more papers by this author

Martijn R. K. Mes

Industrial Engineering and Business Information Systems, University of Twente, 7500 AE Enschede, Netherlands;

Search for more papers by this author

Warren B. Powell

Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540

Search for more papers by this author

Gerald van den Berg

Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540

Search for more papers by this author

Published Online:18 Jan 2019https://doi.org/10.1287/opre.2018.1772

References

Arlotto A, Gans N, Steele JM (2014) Markov decision problems where means bound variances. Oper. Res. 62(4):864–875.Link, Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2/3):235–256.Crossref, Google Scholar
Baird LC (1995) Residual algorithms: Reinforcement learning with function approximation. Prieditis A, Russell SJ, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 30–37.Crossref, Google Scholar
Bellman R (1957) Dynamic Programming (Princeton University Press, Princeton, NJ).Google Scholar
Bellman R, Kalaba R (1959) On adaptive control processes. IRE Trans. Automatic Control 4(2):1–9.Crossref, Google Scholar
Bertsekas DP (2012) Dynamic Programming and Optimal Control: Approximate Dynamic Programming, vol. 2, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Brafman RI, Tennenholtz M (2003) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
Brezzi M, Lai T (2000) Incomplete learning from endogenous data in dynamic allocation. Econometrica 68(6):1511–1516.Crossref, Google Scholar
Brown DB, Smith JE (2013) Optimal sequential exploration: Bandits, clairvoyants, and wildcats. Oper. Res. 61(3):644–665.Link, Google Scholar
Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement Learning and Dynamic Programming Using Function Approximators (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Castanon DA (1997) Approximate dynamic programming for sensor management. Proc. 36th IEEE Conf. Decision Control, vol. 2 (IEEE, Piscataway, NJ), 1202–1207.Crossref, Google Scholar
Cervellera C, Chen VCP, Wen A (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur. J. Oper. Res. 171(3):1139–1151.Crossref, Google Scholar
Chick SE (2006) Subjective probability and Bayesian methodology. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 225–258.Google Scholar
Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. Mostow J, Rich C, eds. Proc. 15th Natl. Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 761–768.Google Scholar
Deisenroth MP, Rasmussen CE, Peters J (2009) Gaussian process dynamic programming. Neurocomput. 72(7):1508–1524.Crossref, Google Scholar
Duff M, Barto A (1996) Local bandit approximation for optimal learning problems. Mozer M, Jordan M, Pesche T, eds. Advances in Neural Information Processing Systems, vol. 9 (MIT Press, Cambridge, MA), 1019–1025.Google Scholar
Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Menlo Park, CA), 154–161.Google Scholar
Engel Y, Mannor S, Meir R (2005) Reinforcement learning with Gaussian processes. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ACM Press, New York, NY), 208–215.Crossref, Google Scholar
Frazier PI, Powell WB, Dayanik S (2009) The knowledge-gradient policy for correlated normal rewards. INFORMS J. Comput. 21(4):599–613.Link, Google Scholar
George A, Powell WB (2006) Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learn. 65(1):167–198.Crossref, Google Scholar
George A, Powell WB, Kulkarni SR (2008) Value function approximation using multiple aggregation for multiattribute resource management. J. Machine Learn. Res. 9:2079–2111.Google Scholar
Gittins JC, Glazebrook KD, Weber R (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
He D, Chick S, Chen CH (2007) Opportunity cost and OCBA selection procedures in ordinal optimization for a fixed number of alternative systems. IEEE Trans. Systems Man Cybernetics 37(5):951–961.Crossref, Google Scholar
He M, Zhao L, Powell WB (2012) Approximate dynamic programming algorithms for optimal dosage decisions in controlled ovarian hyperstimulation. Eur. J. Oper. Res. 222(2):328–340.Crossref, Google Scholar
Hong LJ, Nelson BL (2009) A brief introduction to optimization via simulation. Rosetti M, Hill R, Johansson B, Dunkin A, Ingalls R, eds. Proc. 2009 Winter Simulation Conf. (IEEE, Piscataway, NJ), 75–85.Crossref, Google Scholar
Jaakkola T, Jordan M, Singh S (1994) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems, vol. 6 (Morgan Kaufmann Publishers, San Francisco), 703–710.Crossref, Google Scholar
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4:237–285.Crossref, Google Scholar
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2):209–232.Crossref, Google Scholar
Kim SH, Nelson BL (2006) Selecting the best system. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 501–534.Google Scholar
Koole G, Pot A (2005) Approximate dynamic programming in multi-skill call centers. Kuhl ME, Steiger NM, Armstrong FB, Joines JA, eds. Proc. 2005 Winter Simulation Conf. (IEEE, Piscataway, NJ), 576–583.Crossref, Google Scholar
Kushner HJ, Yin G (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer-Verlag, New York).Google Scholar
Lagoudakis MG, Parr R, Littman ML (2002) Least-squares methods in reinforcement learning for control. Vlahavas I, Spyropoulos C, eds. Methods and Applications of Artificial Intelligence: Second Hellenic Conference on AI (Springer-Verlag, New York), 249–260.Crossref, Google Scholar
Lai G, Margot F, Secomandi N (2010) An approximate dynamic programming approach to benchmark practice-based heuristics for natural gas storage valuation. Oper. Res. 58(3):564–582.Link, Google Scholar
Löhndorf N, Minner S (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1(1):61–77.Crossref, Google Scholar
Mes MRK, Frazier PI, Powell WB (2011) Hierarchical knowledge gradient for sequential sampling. J. Machine Learn. Res. 12(10):2931–2974.Google Scholar
Minkoff AS (1993) A Markov decision model and decomposition heuristic for dynamic vehicle dispatching. Oper. Res. 41(1):77–90.Link, Google Scholar
Nascimento JM, Powell WB (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.Link, Google Scholar
Nascimento JM, Powell WB (2010) Dynamic programming models and algorithms for the mutual fund cash balance problem. Management Sci. 56(5):801–815.Link, Google Scholar
Negoescu DM, Frazier PI, Powell WB (2011) The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(3):346–363.Link, Google Scholar
Neu G, György A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., La Palma, Canary Islands, 805–813.Google Scholar
Pazis J, Parr R (2013) PAC optimal exploration in continuous space Markov decision processes. Proc. 27th AAAI Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 774–781.Google Scholar
Pérez Rivera AE, Mes MRK (2017) Anticipatory freight selection in intermodal long-haul round-trips. Transportation Res. Part E: Logist. Transportation Rev. 105:176–194.Crossref, Google Scholar
Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. (John Wiley & Sons, New York).Crossref, Google Scholar
Powell WB, Ryzhov IO (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Puterman ML (1994) Markov Decision Processes (John Wiley & Sons, New York).Crossref, Google Scholar
Rasmussen CE, Williams CKI (2006) Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA).Google Scholar
Ribeiro CHC, Szepesvári C (1996) Q-learning combined with spreading: Convergence and results. Proc. ISRF-IEE Internat. Conf. Intelligent Cognitive Systems (Neural Networks Symposium, Tehran, Iran), 32–36.Google Scholar
Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Ryzhov IO (2015) Approximate Bayesian inference for simulation and optimization. Defourny B, Terlaky T, eds. Modeling and Optimization: Theory and Applications (Springer, New York), 1–28.Crossref, Google Scholar
Ryzhov IO, Chen Y (2017) Bayesian belief models in simulation-based decision-making. Tolk A, Fowler J, Shao G, Yücesan E, eds. Advances in Modeling and Simulation: Seminal Research from 50 Years of Winter Simulation Conferences (Springer, New York), 181–218.Crossref, Google Scholar
Ryzhov IO, Powell WB (2010) Approximate dynamic programming with correlated Bayesian beliefs. Viswanath P, Meyn S, eds. Proc. 48th Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1360–1367.Crossref, Google Scholar
Ryzhov IO, Powell WB (2011) Bayesian active learning with basis functions. Sarangapani J, Zhang H, Wiering M, eds. Proc. 2011 IEEE Sympos. Adaptive Dynamic Programming Reinforcement Learn. (IEEE, Piscataway, NJ), 143–150.Crossref, Google Scholar
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
Sabes P (1993) Approximating Q-values with basis function representations. Mozer M, Touretzky D, Smolensky P, eds. Proc. 4th Connectionist Models Summer School (Morgan Kaufmann Publishers, San Mateo, CA), 264–271.Google Scholar
Salas DF, Powell WB (2018) Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of multidimensional energy storage problems. INFORMS J. Comput. 30(1):106–123.Link, Google Scholar
Scott WR, Frazier PI, Powell WB (2011) The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3):996–1026.Crossref, Google Scholar
Secomandi N (2010) Optimal commodity trading with a capacitated storage asset. Management Sci. 56(3):449–467.Link, Google Scholar
Simão HP, Day J, George AP, Gifford T, Nienow J, Powell WB (2009) An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Sci. 43(2):178–197.Link, Google Scholar
Simão HP, George A, Powell WB, Gifford T, Nienow J, Day J (2010) Approximate dynamic programming captures fleet operations for Schneider National. Interfaces 40(5):342–352.Link, Google Scholar
Srinivas N, Krause A, Kakade SM, Seeger M (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1015–1022.Google Scholar
Sutton R, Barto A (1998) Reinforcement Learning (MIT Press, Cambridge, MA).Google Scholar
Sutton R, Szepesvári C, Maei H (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Bengio Y, Schuurmans D, Bottou L, Culotta R, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 1609–1616.Google Scholar
Sutton R, Maei H, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Bottou L, Littman M, eds. Proc. 26th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 993–1000.Crossref, Google Scholar
Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4 (Morgan & Claypool Publishers, San Rafael, CA).Crossref, Google Scholar
Szepesvári C, Littman M (1999) A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput. 11(8):2017–2060.Crossref, Google Scholar
Tesauro G (1992) Practical issues in temporal difference learning. Machine Learn. 8(3/4):257–277.Crossref, Google Scholar
Topaloglu H, Powell WB (2006) Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS J. Comput. 18(1):31–42.Link, Google Scholar
Tsitsiklis J (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
Watkins CJCH, Dayan P (1992) Q-learning. Machine Learn. 8(3):279–292.Crossref, Google Scholar
Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
Zhou Y, Scheller-Wolf A, Secomandi N, Smith S (2018) Managing wind-based electricity generation in the presence of storage and transmission capacity. Working Paper, Singapore Management University, Singapore.Google Scholar

Volume 67, Issue 1

January-February 2019

Pages ii-iv, 1-294

Article Information

Supplemental Material

Metrics

Information

Received:July 22, 2015
Accepted:May 10, 2018
Published Online:January 18, 2019

Cite as

Ilya O. Ryzhov, Martijn R. K. Mes, Warren B. Powell, Gerald van den Berg (2019) Bayesian Exploration for Approximate Dynamic Programming. Operations Research 67(1):198-214.

https://doi.org/10.1287/opre.2018.1772

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Bayesian Exploration for Approximate Dynamic Programming

References

Volume 67, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News