Bayesian Exploration for Approximate Dynamic Programming
Published Online:18 Jan 2019https://doi.org/10.1287/opre.2018.1772
References
- (2014) Markov decision problems where means bound variances. Oper. Res. 62(4):864–875.Link, Google Scholar
- (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2/3):235–256.Crossref, Google Scholar
- (1995) Residual algorithms: Reinforcement learning with function approximation. Prieditis A, Russell SJ, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 30–37.Crossref, Google Scholar
- (1957) Dynamic Programming (Princeton University Press, Princeton, NJ).Google Scholar
- (1959) On adaptive control processes. IRE Trans. Automatic Control 4(2):1–9.Crossref, Google Scholar
- (2012) Dynamic Programming and Optimal Control: Approximate Dynamic Programming, vol. 2, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2003) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
- (2000) Incomplete learning from endogenous data in dynamic allocation. Econometrica 68(6):1511–1516.Crossref, Google Scholar
- (2013) Optimal sequential exploration: Bandits, clairvoyants, and wildcats. Oper. Res. 61(3):644–665.Link, Google Scholar
- (2010) Reinforcement Learning and Dynamic Programming Using Function Approximators (CRC Press, Boca Raton, FL).Crossref, Google Scholar
- (1997) Approximate dynamic programming for sensor management. Proc. 36th IEEE Conf. Decision Control, vol. 2 (IEEE, Piscataway, NJ), 1202–1207.Crossref, Google Scholar
- (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur. J. Oper. Res. 171(3):1139–1151.Crossref, Google Scholar
- (2006) Subjective probability and Bayesian methodology. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 225–258.Google Scholar
- (1998) Bayesian Q-learning. Mostow J, Rich C, eds. Proc. 15th Natl. Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 761–768.Google Scholar
- (2009) Gaussian process dynamic programming. Neurocomput. 72(7):1508–1524.Crossref, Google Scholar
- (1996) Local bandit approximation for optimal learning problems. Mozer M, Jordan M, Pesche T, eds. Advances in Neural Information Processing Systems, vol. 9 (MIT Press, Cambridge, MA), 1019–1025.Google Scholar
- (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Menlo Park, CA), 154–161.Google Scholar
- (2005) Reinforcement learning with Gaussian processes. De Raedt L, Wrobel S, eds. Proc. 22nd Internat. Conf. Machine Learn. (ACM Press, New York, NY), 208–215.Crossref, Google Scholar
- (2009) The knowledge-gradient policy for correlated normal rewards. INFORMS J. Comput. 21(4):599–613.Link, Google Scholar
- (2006) Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learn. 65(1):167–198.Crossref, Google Scholar
- (2008) Value function approximation using multiple aggregation for multiattribute resource management. J. Machine Learn. Res. 9:2079–2111.Google Scholar
- (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2007) Opportunity cost and OCBA selection procedures in ordinal optimization for a fixed number of alternative systems. IEEE Trans. Systems Man Cybernetics 37(5):951–961.Crossref, Google Scholar
- (2012) Approximate dynamic programming algorithms for optimal dosage decisions in controlled ovarian hyperstimulation. Eur. J. Oper. Res. 222(2):328–340.Crossref, Google Scholar
- (2009) A brief introduction to optimization via simulation. Rosetti M, Hill R, Johansson B, Dunkin A, Ingalls R, eds. Proc. 2009 Winter Simulation Conf. (IEEE, Piscataway, NJ), 75–85.Crossref, Google Scholar
- (1994) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems, vol. 6 (Morgan Kaufmann Publishers, San Francisco), 703–710.Crossref, Google Scholar
- (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4:237–285.Crossref, Google Scholar
- (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2):209–232.Crossref, Google Scholar
- (2006) Selecting the best system. Henderson SG, Nelson BL, eds. Handbooks of Operations Research and Management Science: Simulation, vol. 13 (North-Holland Publishing, Amsterdam), 501–534.Google Scholar
- (2005) Approximate dynamic programming in multi-skill call centers. Kuhl ME, Steiger NM, Armstrong FB, Joines JA, eds. Proc. 2005 Winter Simulation Conf. (IEEE, Piscataway, NJ), 576–583.Crossref, Google Scholar
- (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer-Verlag, New York).Google Scholar
- (2002) Least-squares methods in reinforcement learning for control. Vlahavas I, Spyropoulos C, eds. Methods and Applications of Artificial Intelligence: Second Hellenic Conference on AI (Springer-Verlag, New York), 249–260.Crossref, Google Scholar
- (2010) An approximate dynamic programming approach to benchmark practice-based heuristics for natural gas storage valuation. Oper. Res. 58(3):564–582.Link, Google Scholar
- (2010) Optimal day-ahead trading and storage of renewable energies—An approximate dynamic programming approach. Energy Systems 1(1):61–77.Crossref, Google Scholar
- (2011) Hierarchical knowledge gradient for sequential sampling. J. Machine Learn. Res. 12(10):2931–2974.Google Scholar
- (1993) A Markov decision model and decomposition heuristic for dynamic vehicle dispatching. Oper. Res. 41(1):77–90.Link, Google Scholar
- (2009) An optimal approximate dynamic programming algorithm for the lagged asset acquisition problem. Math. Oper. Res. 34(1):210–237.Link, Google Scholar
- (2010) Dynamic programming models and algorithms for the mutual fund cash balance problem. Management Sci. 56(5):801–815.Link, Google Scholar
- (2011) The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(3):346–363.Link, Google Scholar
- (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., La Palma, Canary Islands, 805–813.Google Scholar
- (2013) PAC optimal exploration in continuous space Markov decision processes. Proc. 27th AAAI Conf. Artificial Intelligence (AAAI Press, Menlo Park, CA), 774–781.Google Scholar
- (2017) Anticipatory freight selection in intermodal long-haul round-trips. Transportation Res. Part E: Logist. Transportation Rev. 105:176–194.Crossref, Google Scholar
- (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. (John Wiley & Sons, New York).Crossref, Google Scholar
- (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (1994) Markov Decision Processes (John Wiley & Sons, New York).Crossref, Google Scholar
- (2006) Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA).Google Scholar
- (1996) Q-learning combined with spreading: Convergence and results. Proc. ISRF-IEE Internat. Conf. Intelligent Cognitive Systems (Neural Networks Symposium, Tehran, Iran), 32–36.Google Scholar
- (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
- (2015) Approximate Bayesian inference for simulation and optimization. Defourny B, Terlaky T, eds. Modeling and Optimization: Theory and Applications (Springer, New York), 1–28.Crossref, Google Scholar
- (2017) Bayesian belief models in simulation-based decision-making. Tolk A, Fowler J, Shao G, Yücesan E, eds. Advances in Modeling and Simulation: Seminal Research from 50 Years of Winter Simulation Conferences (Springer, New York), 181–218.Crossref, Google Scholar
- (2010) Approximate dynamic programming with correlated Bayesian beliefs. Viswanath P, Meyn S, eds. Proc. 48th Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1360–1367.Crossref, Google Scholar
- (2011) Bayesian active learning with basis functions. Sarangapani J, Zhang H, Wiering M, eds. Proc. 2011 IEEE Sympos. Adaptive Dynamic Programming Reinforcement Learn. (IEEE, Piscataway, NJ), 143–150.Crossref, Google Scholar
- (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
- (1993) Approximating Q-values with basis function representations. Mozer M, Touretzky D, Smolensky P, eds. Proc. 4th Connectionist Models Summer School (Morgan Kaufmann Publishers, San Mateo, CA), 264–271.Google Scholar
- (2018) Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of multidimensional energy storage problems. INFORMS J. Comput. 30(1):106–123.Link, Google Scholar
- (2011) The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3):996–1026.Crossref, Google Scholar
- (2010) Optimal commodity trading with a capacitated storage asset. Management Sci. 56(3):449–467.Link, Google Scholar
- (2009) An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Sci. 43(2):178–197.Link, Google Scholar
- (2010) Approximate dynamic programming captures fleet operations for Schneider National. Interfaces 40(5):342–352.Link, Google Scholar
- (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1015–1022.Google Scholar
- (1998) Reinforcement Learning (MIT Press, Cambridge, MA).Google Scholar
- (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Bengio Y, Schuurmans D, Bottou L, Culotta R, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 1609–1616.Google Scholar
- (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Bottou L, Littman M, eds. Proc. 26th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 993–1000.Crossref, Google Scholar
- (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4 (Morgan & Claypool Publishers, San Rafael, CA).Crossref, Google Scholar
- (1999) A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput. 11(8):2017–2060.Crossref, Google Scholar
- (1992) Practical issues in temporal difference learning. Machine Learn. 8(3/4):257–277.Crossref, Google Scholar
- (2006) Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS J. Comput. 18(1):31–42.Link, Google Scholar
- (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
- (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
- (1992) Q-learning. Machine Learn. 8(3):279–292.Crossref, Google Scholar
- (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
- (2018) Managing wind-based electricity generation in the presence of storage and transmission capacity. Working Paper, Singapore Management University, Singapore.Google Scholar

