Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors

Weidong Han
Corresponding Author
Weidong Han
[email protected]
https://orcid.org/0000-0003-2217-7125
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540
Search for more papers by this author
,
Warren B. Powell
Corresponding Author
Warren B. Powell
[email protected]
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540
Search for more papers by this author

Weidong Han

Corresponding Author

Weidong Han

[email protected]

https://orcid.org/0000-0003-2217-7125

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540

Search for more papers by this author

Warren B. Powell

Corresponding Author

Warren B. Powell

[email protected]

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08540

Search for more papers by this author

Published Online:29 May 2020https://doi.org/10.1287/opre.2019.1921

References

Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory (COLT) (PMLR, Edinburgh, UK), 39.1–39.26.Google Scholar
Audibert JY, Bubeck S (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (COLT), Haifa, Israel.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.Crossref, Google Scholar
Bertsimas D, Nino-Mora J (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48(1):80–90.Link, Google Scholar
Bertsimas D, Perakis G (2006) Dynamic pricing: A learning approach. Lawphongpanich S, Hearn DW, eds. Mathematical and Computational Models for Congestion Charging (Springer, Boston), 45–79.Crossref, Google Scholar
Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.Link, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
Bubeck S, Stoltz G, Szepesvári C, Munos R (2009) Online optimization in x-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 201–208.Google Scholar
Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.Link, Google Scholar
Chen S, Reyes KG, Gupta M, McAlpine MC, Powell WB (2015) Optimal learning in experimental design using the knowledge gradient policy with application to characterizing the nanoemulsion stability. SIAM/ASA J. Uncertainty Quantification 3:320–345.Crossref, Google Scholar
Chick SE, Gans N (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421–437.Link, Google Scholar
Chick SE, Gans N, Yapar O (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. INSEAD Working Paper 2018/20/TOM/ACGRE, INSEAD, Fontainebleau, France.Google Scholar
DeGroot M (1970) Optimal Statistical Decisions (McGraw-Hill, New York).Google Scholar
Deisenroth MP, Neumann G, Peters J (2013) A survey on policy search for robotics. Foundations Trends Robotics 2(12):1–142.Google Scholar
Duff MO (1995) Q-learning for bandit problems. Prieditis A, Russell S, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 209–217.Google Scholar
Edwards J, Fearnhead P, Glazebrook K (2017) On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probab. Engrg. Inform. Sci. 31(2):239–263.Crossref, Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599–613.Link, Google Scholar
Frazier PI, Powell WB (2010) Paradoxes in learning and the marginal value of information. Decision Anal. 7(4):378–403.Link, Google Scholar
Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
Gittins J, Jones D (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed. Progress in Statistics (North-Holland, Amsterdam), 241–266.Google Scholar
Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Google Scholar
Holford NHG, Sheiner LB (1981) Understanding the dose-effect relationship. Clinical Pharmacokinetics 6(6):429–453.Crossref, Google Scholar
Hu J, Fu MC, Ramezani VR, Marcus SI (2007) An evolutionary random policy search algorithm for solving Markov decision processes. INFORMS J. Comput. 19(2):161–174.Link, Google Scholar
Huang D, Allen TT, Notz WI, Zeng N (2006) Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Global Optim. 34(3):441–466.Crossref, Google Scholar
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4):455–492.Crossref, Google Scholar
Kaelbling LP (1993) Learning in Embedded Systems (MIT Press, Cambridge, MA).Crossref, Google Scholar
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4(1):237–285.Crossref, Google Scholar
Katehakis MN, Veinott AF (1987) The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 12(2):262–268.Link, Google Scholar
Keskin NB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091–1114.Crossref, Google Scholar
Lovejoy WS (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28(1):47–66.Crossref, Google Scholar
Mannor S, Rubinstein R, Gat Y (2003) The cross entropy method for fast policy search. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Washington, DC), 512–519.Google Scholar
Ng AY, Jordan M (2000) Pegasus: A policy search method for large MDPS and POMDPS. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 406–415.Google Scholar
Peshkin L, Kim KE, Meuleau N, Kaelbling LP (2000) Learning to cooperate via policy search. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 489–496.Google Scholar
Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Powell WB (2016) A unified framework for optimization under uncertainty. Gupta A, Capponi A, eds. Optimization Challenges in Complex, Networked and Risky Systems, TuTORials in Operations Research (INFORMS, Catonsville, MD), 45–83Link, Google Scholar
Powell WB, Meisel S (2016) Tutorial on stochastic optimization in energy—Part I: Modeling and policies. IEEE Trans. Power Systems 31(2):1459–1467.Crossref, Google Scholar
Powell WB, Ryzhov IO (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Ross S, Pineau J, Paquet S, Chaib-draa B (2008) Online planning algorithms for POMDPS. J. Artificial Intelligence Res. 32(1):663–704.Crossref, Google Scholar
Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
Silver D, Veness J (2010) Monte-Carlo planning in large POMDPS. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing, vol. 23 (Curran Associates, Red Hook, NY), 2164–2172.Google Scholar

Volume 68, Issue 5

September-October 2020

Pages iii-vi, 1285-1624, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:March 04, 2017
Accepted:July 22, 2019
Published Online:May 29, 2020

Cite as

Weidong Han, Warren B. Powell (2020) Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors. Operations Research 68(5):1538-1556.

https://doi.org/10.1287/opre.2019.1921

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Optimal Online Learning for Nonlinear Belief Models Using Discrete Priors

References

Volume 68, Issue 5

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News