Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors

Published Online:https://doi.org/10.1287/opre.2019.1921

References

  • Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory (COLT) (PMLR, Edinburgh, UK), 39.1–39.26.Google Scholar
  • Audibert JY, Bubeck S (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (COLT), Haifa, Israel.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.CrossrefGoogle Scholar
  • Bertsimas D, Nino-Mora J (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48(1):80–90.LinkGoogle Scholar
  • Bertsimas D, Perakis G (2006) Dynamic pricing: A learning approach. Lawphongpanich S, Hearn DW, eds. Mathematical and Computational Models for Congestion Charging (Springer, Boston), 45–79.CrossrefGoogle Scholar
  • Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.LinkGoogle Scholar
  • Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.LinkGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
  • Bubeck S, Stoltz G, Szepesvári C, Munos R (2009) Online optimization in x-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 201–208.Google Scholar
  • Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.LinkGoogle Scholar
  • Chen S, Reyes KG, Gupta M, McAlpine MC, Powell WB (2015) Optimal learning in experimental design using the knowledge gradient policy with application to characterizing the nanoemulsion stability. SIAM/ASA J. Uncertainty Quantification 3:320–345.CrossrefGoogle Scholar
  • Chick SE, Gans N (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421–437.LinkGoogle Scholar
  • Chick SE, Gans N, Yapar O (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. INSEAD Working Paper 2018/20/TOM/ACGRE, INSEAD, Fontainebleau, France.Google Scholar
  • DeGroot M (1970) Optimal Statistical Decisions (McGraw-Hill, New York).Google Scholar
  • Deisenroth MP, Neumann G, Peters J (2013) A survey on policy search for robotics. Foundations Trends Robotics 2(12):1–142.Google Scholar
  • Duff MO (1995) Q-learning for bandit problems. Prieditis A, Russell S, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 209–217.Google Scholar
  • Edwards J, Fearnhead P, Glazebrook K (2017) On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probab. Engrg. Inform. Sci. 31(2):239–263.CrossrefGoogle Scholar
  • Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
  • Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599–613.LinkGoogle Scholar
  • Frazier PI, Powell WB (2010) Paradoxes in learning and the marginal value of information. Decision Anal. 7(4):378–403.LinkGoogle Scholar
  • Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.CrossrefGoogle Scholar
  • Gittins J, Jones D (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed. Progress in Statistics (North-Holland, Amsterdam), 241–266.Google Scholar
  • Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Holford NHG, Sheiner LB (1981) Understanding the dose-effect relationship. Clinical Pharmacokinetics 6(6):429–453.CrossrefGoogle Scholar
  • Hu J, Fu MC, Ramezani VR, Marcus SI (2007) An evolutionary random policy search algorithm for solving Markov decision processes. INFORMS J. Comput. 19(2):161–174.LinkGoogle Scholar
  • Huang D, Allen TT, Notz WI, Zeng N (2006) Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Global Optim. 34(3):441–466.CrossrefGoogle Scholar
  • Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4):455–492.CrossrefGoogle Scholar
  • Kaelbling LP (1993) Learning in Embedded Systems (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4(1):237–285.CrossrefGoogle Scholar
  • Katehakis MN, Veinott AF (1987) The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 12(2):262–268.LinkGoogle Scholar
  • Keskin NB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.LinkGoogle Scholar
  • Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091–1114.CrossrefGoogle Scholar
  • Lovejoy WS (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28(1):47–66.CrossrefGoogle Scholar
  • Mannor S, Rubinstein R, Gat Y (2003) The cross entropy method for fast policy search. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Washington, DC), 512–519.Google Scholar
  • Ng AY, Jordan M (2000) Pegasus: A policy search method for large MDPS and POMDPS. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 406–415.Google Scholar
  • Peshkin L, Kim KE, Meuleau N, Kaelbling LP (2000) Learning to cooperate via policy search. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 489–496.Google Scholar
  • Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Powell WB (2016) A unified framework for optimization under uncertainty. Gupta A, Capponi A, eds. Optimization Challenges in Complex, Networked and Risky Systems, TuTORials in Operations Research (INFORMS, Catonsville, MD), 45–83LinkGoogle Scholar
  • Powell WB, Meisel S (2016) Tutorial on stochastic optimization in energy—Part I: Modeling and policies. IEEE Trans. Power Systems 31(2):1459–1467.CrossrefGoogle Scholar
  • Powell WB, Ryzhov IO (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Ross S, Pineau J, Paquet S, Chaib-draa B (2008) Online planning algorithms for POMDPS. J. Artificial Intelligence Res. 32(1):663–704.CrossrefGoogle Scholar
  • Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.LinkGoogle Scholar
  • Silver D, Veness J (2010) Monte-Carlo planning in large POMDPS. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing, vol. 23 (Curran Associates, Red Hook, NY), 2164–2172.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.