Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors
Published Online:29 May 2020https://doi.org/10.1287/opre.2019.1921
References
- (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory (COLT) (PMLR, Edinburgh, UK), 39.1–39.26.Google Scholar
- (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (COLT), Haifa, Israel.Google Scholar
- (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.Crossref, Google Scholar
- (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48(1):80–90.Link, Google Scholar
- (2006) Dynamic pricing: A learning approach. Lawphongpanich S, Hearn DW, eds. Mathematical and Computational Models for Congestion Charging (Springer, Boston), 45–79.Crossref, Google Scholar
- (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
- (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.Link, Google Scholar
- (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
- (2009) Online optimization in x-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 201–208.Google Scholar
- (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.Link, Google Scholar
- (2015) Optimal learning in experimental design using the knowledge gradient policy with application to characterizing the nanoemulsion stability. SIAM/ASA J. Uncertainty Quantification 3:320–345.Crossref, Google Scholar
- (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421–437.Link, Google Scholar
- (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. INSEAD Working Paper 2018/20/TOM/ACGRE, INSEAD, Fontainebleau, France.Google Scholar
- (1970) Optimal Statistical Decisions (McGraw-Hill, New York).Google Scholar
- (2013) A survey on policy search for robotics. Foundations Trends Robotics 2(12):1–142.Google Scholar
- (1995) Q-learning for bandit problems. Prieditis A, Russell S, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 209–217.Google Scholar
- (2017) On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probab. Engrg. Inform. Sci. 31(2):239–263.Crossref, Google Scholar
- (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
- (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599–613.Link, Google Scholar
- (2010) Paradoxes in learning and the marginal value of information. Decision Anal. 7(4):378–403.Link, Google Scholar
- (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
- (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed. Progress in Statistics (North-Holland, Amsterdam), 241–266.Google Scholar
- (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (1981) Understanding the dose-effect relationship. Clinical Pharmacokinetics 6(6):429–453.Crossref, Google Scholar
- (2007) An evolutionary random policy search algorithm for solving Markov decision processes. INFORMS J. Comput. 19(2):161–174.Link, Google Scholar
- (2006) Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Global Optim. 34(3):441–466.Crossref, Google Scholar
- (1998) Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4):455–492.Crossref, Google Scholar
- (1993) Learning in Embedded Systems (MIT Press, Cambridge, MA).Crossref, Google Scholar
- (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4(1):237–285.Crossref, Google Scholar
- (1987) The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 12(2):262–268.Link, Google Scholar
- (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091–1114.Crossref, Google Scholar
- (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28(1):47–66.Crossref, Google Scholar
- (2003) The cross entropy method for fast policy search. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Washington, DC), 512–519.Google Scholar
- (2000) Pegasus: A policy search method for large MDPS and POMDPS. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 406–415.Google Scholar
- (2000) Learning to cooperate via policy search. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 489–496.Google Scholar
- (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2016) A unified framework for optimization under uncertainty. Gupta A, Capponi A, eds. Optimization Challenges in Complex, Networked and Risky Systems, TuTORials in Operations Research (INFORMS, Catonsville, MD), 45–83Link, Google Scholar
- (2016) Tutorial on stochastic optimization in energy—Part I: Modeling and policies. IEEE Trans. Power Systems 31(2):1459–1467.Crossref, Google Scholar
- (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2008) Online planning algorithms for POMDPS. J. Artificial Intelligence Res. 32(1):663–704.Crossref, Google Scholar
- (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
- (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
- (2010) Monte-Carlo planning in large POMDPS. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing, vol. 23 (Curran Associates, Red Hook, NY), 2164–2172.Google Scholar

