Mostly Exploration-Free Algorithms for Contextual Bandits

Published Online:https://doi.org/10.1287/mnsc.2020.3605

References

  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems 24 (Curran Associates, Inc., New York), 2312–2320.Google Scholar
  • Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. 30th Internat. Conf. Machine Learn. (ICML 2013) (JMLR.org, Atlanta), 127–135.Google Scholar
  • Agrawal S, Avadhanula V, Goyal V, Zeevi A (2019) MNL-bandit: A dynamic learning approach to assortment selection. Oper. Res. 67(5):1453–1485.Google Scholar
  • Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3:397–422.Google Scholar
  • Ban G-Y, Keskin NB (2020) Personalized dynamic pricing with machine learning: High dimensional features and heterogeneous elasticity. Management Sci. Forthcoming.Google Scholar
  • Bastani H, Bayati M (2019) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Google Scholar
  • Bastani H, Simchi-Levi D, Zhu R (2019) Meta dynamic pricing: Learning across experiments. Preprint, submitted February 14, http://dx.doi.org/10.2139/ssrn.3334629.Google Scholar
  • Bastani H, Harsha P, Perakis G, Singhvi D (2018) Learning personalized product recommendations with customer disengagement. Preprint, submitted September 13, http://dx.doi.org/10.2139/ssrn.3240970.Google Scholar
  • Bietti A, Agarwal A, Langford J (2018) A contextual bandit bake-off. Preprint, submitted August 29, https://arxiv.org/abs/1802.04064.Google Scholar
  • Bird S, Barocas S, Crawford K, Diaz F, Wallach H (2016) Exploring or exploiting? Social and ethical implications of autonomous experimentation in AI. Preprint, submitted October 4, https://ssrn.com/abstract=2846909.Google Scholar
  • Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.LinkGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
  • Chen K, Hu I, Ying Z (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist. 27(4):1155–1163.CrossrefGoogle Scholar
  • Chick SE, Gans N, Yapar O (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. Working paper, INSEAD, Cedex, France.Google Scholar
  • Chu W, Li L, Reyzin L, Schapire RE (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (AISTATS) 2011 (Fort Lauderdale, FL), 208–214.Google Scholar
  • Cohen MC, Lobel I, Leme RP (2016) Feature-based dynamic pricing. Preprint, submitted February 23, https://dx.doi.org/10.2139/ssrn.2737045.Google Scholar
  • Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (Helsinki, Finland), 355–366.Google Scholar
  • den Boer AV, Zwart B (2013) Simultaneously learning and optimizing using controlled variance pricing. Management Sci. 60(3):770–783.LinkGoogle Scholar
  • Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems 23 (Curran Associates, Inc., New York), 586–594.Google Scholar
  • Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Royal Statist. Soc. B 41(2):148–164.CrossrefGoogle Scholar
  • Goldenshluger A, Zeevi A (2009) Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19(4):1603–1633.CrossrefGoogle Scholar
  • Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.LinkGoogle Scholar
  • Gutin E, Farias V (2016) Optimistic Gittins indices. Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, eds. Adv. Neural Inform. Processing Systems 29 (Curran Associates, Inc., New York),  3153–3161.Google Scholar
  • International Warfarin Pharmacogenetics Consortium (2009) Estimation of the warfarin dose with clinical and pharmacogenetic data. New England J. Medicine 360(8):753.Google Scholar
  • Javanmard A, Nazerzadeh H (2019) Dynamic pricing in high-dimensions. J. Machine Learn. Res. 20(1):315–363.Google Scholar
  • Kallus N, Udell M (2016) Dynamic assortment personalization in high dimensions. Preprint, submitted October 18, https://arxiv.org/abs/1610.05604.Google Scholar
  • Kallus N, Zhou A (2018) Policy evaluation and optimization with continuous treatments. Preprint, submitted February 16, https://arxiv.org/abs/1802.06037.Google Scholar
  • Kannan S, Morgenstern J, Roth A, Waggoner B, Wu ZS (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Adv. Neural Inform. Processing Systems (NeurIPS) 2018(December):2227–2236.Google Scholar
  • Kazerouni A, Ghavamzadeh M, Abbasi-Yadkori Y, Van Roy B (2017) Conservative contextual linear bandits. Adv. Neural Inform. Processing Systems (NeurIPS) 30(2017):3910–3919.Google Scholar
  • Keskin KB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.LinkGoogle Scholar
  • Keskin KB, Zeevi A (2018) On incomplete learning and certainty-equivalence control. Oper. Res. 66(4):1136–1167.Google Scholar
  • Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, et al.. (2011) The BATTLE trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.CrossrefGoogle Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Platt JC, Koller D, Singer Y, Roweis ST, eds. Proc. 20th Internat. Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates Inc., New York), 817–824.Google Scholar
  • Lattimore T, Munos R (2014) Bounded regret for finite-armed structured bandits. Adv. Neural Inform. Processing Systems 27:550–558.Google Scholar
  • Lehmann EL, Casella G (1998) Theory of Point Estimation (Springer Verlag, New York).Google Scholar
  • Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. Machine Learn. (Curran Associates Inc., New York), 2071–2080.Google Scholar
  • Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
  • McCullagh P, Nelder JA (1989) Generalized Linear Models, 2nd ed. (Chapman & Hall, London).CrossrefGoogle Scholar
  • Mersereau AJ, Rusmevichientong R, Tsitsiklis JN (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatated Control 54(12):2787–2802.CrossrefGoogle Scholar
  • Mintz Y, Aswani A, Kaminsky P, Flowers E, Fukuoka Y (2017) Non-stationary bandits with habituation and recovery dynamics. Preprint, submitted July 26, https://arxiv.org/abs/1707.08423.Google Scholar
  • Narendra KS, Annaswamy AM (1987) Persistent excitation in adaptive systems. Internat. J. Control 45(1):127–160.CrossrefGoogle Scholar
  • Nguyen NT (2018) Model-Reference Adaptive Control (Springer, New York).CrossrefGoogle Scholar
  • Qiang S, Bayati M (2016) Dynamic pricing with demand covariates. Preprint, submitted April 14, http://dx.doi.org/10.2139/ssrn.2765257.Google Scholar
  • Russo D (2019) A note on the equivalence of upper confidence bounds and gittins indices for patient agents. Preprint, submitted April 9, https://arxiv.org/abs/1904.04732.Google Scholar
  • Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Google Scholar
  • Sarkar J (1991) One-armed bandit problems with covariates. Ann. Statist. 19(4):1978–2002.CrossrefGoogle Scholar
  • Tewari A, Murphy SA (2017) From ads to interventions: Contextual bandits in mobile health. Rehg J, Murphy S, Kumar S, eds. Mobile Health (Springer, New York), 495–517.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Tropp JA (2011) User-friendly tail bounds for matrix martingales. Technical Report TR-2011-01, California Institute of Technology, Pasadena.Google Scholar
  • Tsybakov AB (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1):135–166.Google Scholar
  • Wainwright M (2016) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Google Scholar
  • Wang C-C, Kulkarni SR, Poor HV (2005a) Arbitrary side observations in bandit problems. Adv. Appl. Math. 34(4):903–938.CrossrefGoogle Scholar
  • Wang C-C, Kulkarni SR, Poor HV (2005b) Bandit problems with side observations. IEEE Trans. Automated Control 50(3):338–355.CrossrefGoogle Scholar
  • Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74(368):799–806.CrossrefGoogle Scholar
  • Wu Y, Shariff R, Lattimore T, Szepesvari C (2016) Conservative bandits. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (JMLR.org, New York), 1254–1262.Google Scholar
  • Zhou Z, Wang Y, Mamani H, Coffey DG (2019) How do tumor cytogenetics inform cancer treatments? Dynamic risk stratification and precision medicine using multi-armed bandits. Preprint, submitted June 17, https://dx.doi.org/10.2139/ssrn.3405082.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.