Bayesian Exploration for Approximate Dynamic Programming

Published Online:https://doi.org/10.1287/opre.2018.1772

References

  • Arlotto A, Gans N, Steele JM (2014) Markov decision problems where means bound variances. Oper. Res. 62(4):864–875.LinkGoogle Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2/3):235–256.CrossrefGoogle Scholar
  • Baird LC (1995) Residual algorithms: Reinforcement learning with function approximation. Prieditis A, Russell SJ, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 30–37.CrossrefGoogle Scholar
  • Bellman R (1957) Dynamic Programming (Princeton University Press, Princeton, NJ).Google Scholar
  • Bellman R, Kalaba R (1959) On adaptive control processes. IRE Trans. Automatic Control 4(2):1–9.CrossrefGoogle Scholar
  • Bertsekas DP (2012) Dynamic Programming and Optimal Control: Approximate Dynamic Programming, vol. 2, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Brafman RI, Tennenholtz M (2003) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
  • Brezzi M, Lai T (2000) Incomplete learning from endogenous data in dynamic allocation. Econometrica 68(6):1511–1516.CrossrefGoogle Scholar
  • Brown DB, Smith JE (2013) Optimal sequential exploration: Bandits, clairvoyants, and wildcats. Oper. Res. 61(3):644–665.LinkGoogle Scholar
  • Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement Learning and Dynamic Programming Using Function Approximators (CRC Press, Boca Raton, FL).CrossrefGoogle Scholar
  • Castanon DA (1997) Approximate dynamic programming for sensor management. Proc. 36th IEEE Conf. Decision Control, vol. 2 (IEEE, Piscataway, NJ), 1202–1207.CrossrefGoogle Scholar
  • Cervellera C, Chen VCP, Wen A (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur. J. Oper. Res. 171(3):1139–1151.CrossrefGoogle Scholar
  • Chick SE (2006) Subjective probability and Bayesian methodology. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 225–258.Google Scholar
  • Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. Mostow J, Rich C, eds. Proc. 15th Natl. Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 761–768.Google Scholar
  • Deisenroth MP, Rasmussen CE, Peters J (2009) Gaussian process dynamic programming. Neurocomput. 72(7):1508–1524.CrossrefGoogle Scholar
  • Duff M, Barto A (1996) Local bandit approximation for optimal learning problems. Mozer M, Jordan M, Pesche T, eds. Advances in Neural Information Processing Systems, vol. 9 (MIT Press, Cambridge, MA), 1019–1025.Google Scholar
  • Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Menlo Park, CA), 154–161.Google Scholar
  • Engel Y, Mannor S, Meir R (2005) Reinforcement learning with Gaussian processes. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ACM Press, New York, NY), 208–215.CrossrefGoogle Scholar
  • Frazier PI, Powell WB, Dayanik S (2009) The knowledge-gradient policy for correlated normal rewards. INFORMS J. Comput. 21(4):599–613.LinkGoogle Scholar
  • George A, Powell WB (2006) Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learn. 65(1):167–198.CrossrefGoogle Scholar
  • George A, Powell WB, Kulkarni SR (2008) Value function approximation using multiple aggregation for multiattribute resource management. J. Machine Learn. Res. 9:2079–2111.Google Scholar
  • Gittins JC, Glazebrook KD, Weber R (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • He D, Chick S, Chen CH (2007) Opportunity cost and OCBA selection procedures in ordinal optimization for a fixed number of alternative systems. IEEE Trans. Systems Man Cybernetics 37(5):951–961.CrossrefGoogle Scholar
  • He M, Zhao L, Powell WB (2012) Approximate dynamic programming algorithms for optimal dosage decisions in controlled ovarian hyperstimulation. Eur. J. Oper. Res. 222(2):328–340.CrossrefGoogle Scholar
  • Hong LJ, Nelson BL (2009) A brief introduction to optimization via simulation. Rosetti M, Hill R, Johansson B, Dunkin A, Ingalls R, eds. Proc. 2009 Winter Simulation Conf. (IEEE, Piscataway, NJ), 75–85.CrossrefGoogle Scholar
  • Jaakkola T, Jordan M, Singh S (1994) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems, vol. 6 (Morgan Kaufmann Publishers, San Francisco), 703–710.CrossrefGoogle Scholar
  • Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4:237–285.CrossrefGoogle Scholar
  • Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2):209–232.CrossrefGoogle Scholar
  • Kim SH, Nelson BL (2006) Selecting the best system. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 501–534.Google Scholar
  • Koole G, Pot A (2005) Approximate dynamic programming in multi-skill call centers. Kuhl ME, Steiger NM, Armstrong FB, Joines JA, eds. Proc. 2005 Winter Simulation Conf. (IEEE, Piscataway, NJ), 576–583.CrossrefGoogle Scholar
  • Kushner HJ, Yin G (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer-Verlag, New York).Google Scholar
  • Lagoudakis MG, Parr R, Littman ML (2002) Least-squares methods in reinforcement learning for control. Vlahavas I, Spyropoulos C, eds. Methods and Applications of Artificial Intelligence: Second Hellenic Conference on AI (Springer-Verlag, New York), 249–260.CrossrefGoogle Scholar
  • Lai G, Margot F, Secomandi N (2010) An approximate dynamic programming approach to benchmark practice-based heuristics for natural gas storage valuation. Oper. Res. 58(3):564–582.LinkGoogle Scholar
  • Löhndorf N, Minner S (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1(1):61–77.CrossrefGoogle Scholar
  • Mes MRK, Frazier PI, Powell WB (2011) Hierarchical knowledge gradient for sequential sampling. J. Machine Learn. Res. 12(10):2931–2974.Google Scholar
  • Minkoff AS (1993) A Markov decision model and decomposition heuristic for dynamic vehicle dispatching. Oper. Res. 41(1):77–90.LinkGoogle Scholar
  • Nascimento JM, Powell WB (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.LinkGoogle Scholar
  • Nascimento JM, Powell WB (2010) Dynamic programming models and algorithms for the mutual fund cash balance problem. Management Sci. 56(5):801–815.LinkGoogle Scholar
  • Negoescu DM, Frazier PI, Powell WB (2011) The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(3):346–363.LinkGoogle Scholar
  • Neu G, György A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., La Palma, Canary Islands, 805–813.Google Scholar
  • Pazis J, Parr R (2013) PAC optimal exploration in continuous space Markov decision processes. Proc. 27th AAAI Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 774–781.Google Scholar
  • Pérez Rivera AE, Mes MRK (2017) Anticipatory freight selection in intermodal long-haul round-trips. Transportation Res. Part E: Logist. Transportation Rev. 105:176–194.CrossrefGoogle Scholar
  • Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. (John Wiley & Sons, New York).CrossrefGoogle Scholar
  • Powell WB, Ryzhov IO (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Puterman ML (1994) Markov Decision Processes (John Wiley & Sons, New York).CrossrefGoogle Scholar
  • Rasmussen CE, Williams CKI (2006) Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA).Google Scholar
  • Ribeiro CHC, Szepesvári C (1996) Q-learning combined with spreading: Convergence and results. Proc. ISRF-IEE Internat. Conf. Intelligent Cognitive Systems (Neural Networks Symposium, Tehran, Iran), 32–36.Google Scholar
  • Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Ryzhov IO (2015) Approximate Bayesian inference for simulation and optimization. Defourny B, Terlaky T, eds. Modeling and Optimization: Theory and Applications (Springer, New York), 1–28.CrossrefGoogle Scholar
  • Ryzhov IO, Chen Y (2017) Bayesian belief models in simulation-based decision-making. Tolk A, Fowler J, Shao G, Yücesan E, eds. Advances in Modeling and Simulation: Seminal Research from 50 Years of Winter Simulation Conferences (Springer, New York), 181–218.CrossrefGoogle Scholar
  • Ryzhov IO, Powell WB (2010) Approximate dynamic programming with correlated Bayesian beliefs. Viswanath P, Meyn S, eds. Proc. 48th Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1360–1367.CrossrefGoogle Scholar
  • Ryzhov IO, Powell WB (2011) Bayesian active learning with basis functions. Sarangapani J, Zhang H, Wiering M, eds. Proc. 2011 IEEE Sympos. Adaptive Dynamic Programming Reinforcement Learn. (IEEE, Piscataway, NJ), 143–150.CrossrefGoogle Scholar
  • Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.LinkGoogle Scholar
  • Sabes P (1993) Approximating Q-values with basis function representations. Mozer M, Touretzky D, Smolensky P, eds. Proc. 4th Connectionist Models Summer School (Morgan Kaufmann Publishers, San Mateo, CA), 264–271.Google Scholar
  • Salas DF, Powell WB (2018) Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of multidimensional energy storage problems. INFORMS J. Comput. 30(1):106–123.LinkGoogle Scholar
  • Scott WR, Frazier PI, Powell WB (2011) The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3):996–1026.CrossrefGoogle Scholar
  • Secomandi N (2010) Optimal commodity trading with a capacitated storage asset. Management Sci. 56(3):449–467.LinkGoogle Scholar
  • Simão HP, Day J, George AP, Gifford T, Nienow J, Powell WB (2009) An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Sci. 43(2):178–197.LinkGoogle Scholar
  • Simão HP, George A, Powell WB, Gifford T, Nienow J, Day J (2010) Approximate dynamic programming captures fleet operations for Schneider National. Interfaces 40(5):342–352.LinkGoogle Scholar
  • Srinivas N, Krause A, Kakade SM, Seeger M (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1015–1022.Google Scholar
  • Sutton R, Barto A (1998) Reinforcement Learning (MIT Press, Cambridge, MA).Google Scholar
  • Sutton R, Szepesvári C, Maei H (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Bengio Y, Schuurmans D, Bottou L, Culotta R, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 1609–1616.Google Scholar
  • Sutton R, Maei H, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Bottou L, Littman M, eds. Proc. 26th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 993–1000.CrossrefGoogle Scholar
  • Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4 (Morgan & Claypool Publishers, San Rafael, CA).CrossrefGoogle Scholar
  • Szepesvári C, Littman M (1999) A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput. 11(8):2017–2060.CrossrefGoogle Scholar
  • Tesauro G (1992) Practical issues in temporal difference learning. Machine Learn. 8(3/4):257–277.CrossrefGoogle Scholar
  • Topaloglu H, Powell WB (2006) Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS J. Comput. 18(1):31–42.LinkGoogle Scholar
  • Tsitsiklis J (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.CrossrefGoogle Scholar
  • Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.CrossrefGoogle Scholar
  • Watkins CJCH, Dayan P (1992) Q-learning. Machine Learn. 8(3):279–292.CrossrefGoogle Scholar
  • Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
  • Zhou Y, Scheller-Wolf A, Secomandi N, Smith S (2018) Managing wind-based electricity generation in the presence of storage and transmission capacity. Working Paper, Singapore Management University, Singapore.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.