Mostly Exploration-Free Algorithms for Contextual Bandits

Hamsa Bastani
Hamsa Bastani
[email protected]
https://orcid.org/0000-0002-8793-4732
Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Mohsen Bayati
Corresponding Author
Mohsen Bayati
[email protected]
https://orcid.org/0000-0002-7280-912X
Stanford Graduate School of Business, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Khashayar Khosravi
Khashayar Khosravi
[email protected]
https://orcid.org/0000-0003-3997-2766
Stanford University Electrical Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author

Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Mohsen Bayati

Corresponding Author

Mohsen Bayati

[email protected]

https://orcid.org/0000-0002-7280-912X

Stanford Graduate School of Business, Stanford University, Stanford, California 94305;

Search for more papers by this author

Khashayar Khosravi

[email protected]

https://orcid.org/0000-0003-3997-2766

Stanford University Electrical Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Published Online:27 Jul 2020https://doi.org/10.1287/mnsc.2020.3605

References

Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems 24 (Curran Associates, Inc., New York), 2312–2320.Google Scholar
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. 30th Internat. Conf. Machine Learn. (ICML 2013) (JMLR.org, Atlanta), 127–135.Google Scholar
Agrawal S, Avadhanula V, Goyal V, Zeevi A (2019) MNL-bandit: A dynamic learning approach to assortment selection. Oper. Res. 67(5):1453–1485.Google Scholar
Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3:397–422.Google Scholar
Ban G-Y, Keskin NB (2020) Personalized dynamic pricing with machine learning: High dimensional features and heterogeneous elasticity. Management Sci. Forthcoming.Google Scholar
Bastani H, Bayati M (2019) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Google Scholar
Bastani H, Simchi-Levi D, Zhu R (2019) Meta dynamic pricing: Learning across experiments. Preprint, submitted February 14, http://dx.doi.org/10.2139/ssrn.3334629.Google Scholar
Bastani H, Harsha P, Perakis G, Singhvi D (2018) Learning personalized product recommendations with customer disengagement. Preprint, submitted September 13, http://dx.doi.org/10.2139/ssrn.3240970.Google Scholar
Bietti A, Agarwal A, Langford J (2018) A contextual bandit bake-off. Preprint, submitted August 29, https://arxiv.org/abs/1802.04064.Google Scholar
Bird S, Barocas S, Crawford K, Diaz F, Wallach H (2016) Exploring or exploiting? Social and ethical implications of autonomous experimentation in AI. Preprint, submitted October 4, https://ssrn.com/abstract=2846909.Google Scholar
Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.Link, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
Chen K, Hu I, Ying Z (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist. 27(4):1155–1163.Crossref, Google Scholar
Chick SE, Gans N, Yapar O (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. Working paper, INSEAD, Cedex, France.Google Scholar
Chu W, Li L, Reyzin L, Schapire RE (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (AISTATS) 2011 (Fort Lauderdale, FL), 208–214.Google Scholar
Cohen MC, Lobel I, Leme RP (2016) Feature-based dynamic pricing. Preprint, submitted February 23, https://dx.doi.org/10.2139/ssrn.2737045.Google Scholar
Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (Helsinki, Finland), 355–366.Google Scholar
den Boer AV, Zwart B (2013) Simultaneously learning and optimizing using controlled variance pricing. Management Sci. 60(3):770–783.Link, Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems 23 (Curran Associates, Inc., New York), 586–594.Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Royal Statist. Soc. B 41(2):148–164.Crossref, Google Scholar
Goldenshluger A, Zeevi A (2009) Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19(4):1603–1633.Crossref, Google Scholar
Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
Gutin E, Farias V (2016) Optimistic Gittins indices. Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, eds. Adv. Neural Inform. Processing Systems 29 (Curran Associates, Inc., New York), 3153–3161.Google Scholar
International Warfarin Pharmacogenetics Consortium (2009) Estimation of the warfarin dose with clinical and pharmacogenetic data. New England J. Medicine 360(8):753.Google Scholar
Javanmard A, Nazerzadeh H (2019) Dynamic pricing in high-dimensions. J. Machine Learn. Res. 20(1):315–363.Google Scholar
Kallus N, Udell M (2016) Dynamic assortment personalization in high dimensions. Preprint, submitted October 18, https://arxiv.org/abs/1610.05604.Google Scholar
Kallus N, Zhou A (2018) Policy evaluation and optimization with continuous treatments. Preprint, submitted February 16, https://arxiv.org/abs/1802.06037.Google Scholar
Kannan S, Morgenstern J, Roth A, Waggoner B, Wu ZS (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Adv. Neural Inform. Processing Systems (NeurIPS) 2018(December):2227–2236.Google Scholar
Kazerouni A, Ghavamzadeh M, Abbasi-Yadkori Y, Van Roy B (2017) Conservative contextual linear bandits. Adv. Neural Inform. Processing Systems (NeurIPS) 30(2017):3910–3919.Google Scholar
Keskin KB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
Keskin KB, Zeevi A (2018) On incomplete learning and certainty-equivalence control. Oper. Res. 66(4):1136–1167.Google Scholar
Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, et al.. (2011) The BATTLE trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.Crossref, Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Platt JC, Koller D, Singer Y, Roweis ST, eds. Proc. 20th Internat. Conf. Neural Inform. Processing Systems (NeurIPS) (Curran Associates Inc., New York), 817–824.Google Scholar
Lattimore T, Munos R (2014) Bounded regret for finite-armed structured bandits. Adv. Neural Inform. Processing Systems 27:550–558.Google Scholar
Lehmann EL, Casella G (1998) Theory of Point Estimation (Springer Verlag, New York).Google Scholar
Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. Machine Learn. (Curran Associates Inc., New York), 2071–2080.Google Scholar
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
McCullagh P, Nelder JA (1989) Generalized Linear Models, 2nd ed. (Chapman & Hall, London).Crossref, Google Scholar
Mersereau AJ, Rusmevichientong R, Tsitsiklis JN (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatated Control 54(12):2787–2802.Crossref, Google Scholar
Mintz Y, Aswani A, Kaminsky P, Flowers E, Fukuoka Y (2017) Non-stationary bandits with habituation and recovery dynamics. Preprint, submitted July 26, https://arxiv.org/abs/1707.08423.Google Scholar
Narendra KS, Annaswamy AM (1987) Persistent excitation in adaptive systems. Internat. J. Control 45(1):127–160.Crossref, Google Scholar
Nguyen NT (2018) Model-Reference Adaptive Control (Springer, New York).Crossref, Google Scholar
Qiang S, Bayati M (2016) Dynamic pricing with demand covariates. Preprint, submitted April 14, http://dx.doi.org/10.2139/ssrn.2765257.Google Scholar
Russo D (2019) A note on the equivalence of upper confidence bounds and gittins indices for patient agents. Preprint, submitted April 9, https://arxiv.org/abs/1904.04732.Google Scholar
Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Google Scholar
Sarkar J (1991) One-armed bandit problems with covariates. Ann. Statist. 19(4):1978–2002.Crossref, Google Scholar
Tewari A, Murphy SA (2017) From ads to interventions: Contextual bandits in mobile health. Rehg J, Murphy S, Kumar S, eds. Mobile Health (Springer, New York), 495–517.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Tropp JA (2011) User-friendly tail bounds for matrix martingales. Technical Report TR-2011-01, California Institute of Technology, Pasadena.Google Scholar
Tsybakov AB (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1):135–166.Google Scholar
Wainwright M (2016) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Google Scholar
Wang C-C, Kulkarni SR, Poor HV (2005a) Arbitrary side observations in bandit problems. Adv. Appl. Math. 34(4):903–938.Crossref, Google Scholar
Wang C-C, Kulkarni SR, Poor HV (2005b) Bandit problems with side observations. IEEE Trans. Automated Control 50(3):338–355.Crossref, Google Scholar
Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74(368):799–806.Crossref, Google Scholar
Wu Y, Shariff R, Lattimore T, Szepesvari C (2016) Conservative bandits. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (JMLR.org, New York), 1254–1262.Google Scholar
Zhou Z, Wang Y, Mamani H, Coffey DG (2019) How do tumor cytogenetics inform cancer treatments? Dynamic risk stratification and precision medicine using multi-armed bandits. Preprint, submitted June 17, https://dx.doi.org/10.2139/ssrn.3405082.Google Scholar

Volume 67, Issue 3

March 2021

Pages 1329-1992, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:October 02, 2018
Accepted:December 24, 2019
Published Online:July 27, 2020

Cite as

Hamsa Bastani, Mohsen Bayati, Khashayar Khosravi (2020) Mostly Exploration-Free Algorithms for Contextual Bandits. Management Science 67(3):1329-1349.

https://doi.org/10.1287/mnsc.2020.3605

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Mostly Exploration-Free Algorithms for Contextual Bandits

References

Volume 67, Issue 3

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News