Distributionally Robust Batch Contextual Bandits

Nian Si
Nian Si
[email protected]
https://orcid.org/0000-0002-4730-543X
Department of Management Science & Engineering, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Fan Zhang
Fan Zhang
[email protected]
Department of Management Science & Engineering, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Zhengyuan Zhou
Corresponding Author
Zhengyuan Zhou
[email protected]
https://orcid.org/0000-0002-0005-9411
Stern School of Business, New York University, New York, New York 10012
Search for more papers by this author
,
Jose Blanchet
Jose Blanchet
[email protected]
https://orcid.org/0000-0001-5895-0912
Department of Management Science & Engineering, Stanford University, Stanford, California 94305;
Search for more papers by this author

Department of Management Science & Engineering, Stanford University, Stanford, California 94305;

Search for more papers by this author

Fan Zhang

[email protected]

Department of Management Science & Engineering, Stanford University, Stanford, California 94305;

Search for more papers by this author

Zhengyuan Zhou

Corresponding Author

Zhengyuan Zhou

[email protected]

https://orcid.org/0000-0002-0005-9411

Stern School of Business, New York University, New York, New York 10012

Search for more papers by this author

Jose Blanchet

[email protected]

https://orcid.org/0000-0001-5895-0912

Department of Management Science & Engineering, Stanford University, Stanford, California 94305;

Search for more papers by this author

Published Online:31 Mar 2023https://doi.org/10.1287/mnsc.2023.4678

References

Abadeh SS, Nguyen VA, Kuhn D, Esfahani PM (2018) Wasserstein distributionally robust Kalman filtering. Bengio S et al., eds. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8483–8492.Google Scholar
Abeille M, Lazaric A (2017) Linear Thompson sampling revisited. Electronic. J. Statist. 11(2):5165–5197.Crossref, Google Scholar
Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. Sixteenth Inter. Conf. on Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. on Machine Learn. (PMLR), 127–135.Google Scholar
Athey S, Wager S (2017) Efficient policy learning. Technical report.Google Scholar
Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. The Operations Research Revolution (Institute for Operations Research and the Management Sciences, Catonsville, MD), 1–19.Link, Google Scholar
Ben-david S, Cesabianchi N, Haussler D, Long PM (1995) Characterizations of learnability for classes of (0,…,n)-valued functions. J. Comput. System Sci. 50(1):74–86.Crossref, Google Scholar
Bertsimas D, Dunn J (2017) Optimal classification trees. Machine Learn. 106(7):1039–1082.Crossref, Google Scholar
Bertsimas D, Mersereau AJ (2007) A learning approach for interactive marketing to a customer segment. Oper. Res. 55(6):1120–1135.Link, Google Scholar
Bertsimas D, Sim M (2004) The price of robustness. Oper. Res. 52(1):35–53.Link, Google Scholar
Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.Link, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
Chapelle O (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (ACM, New York), 1097–1105.Google Scholar
Chen Z, Kuhn D, Wiesemann W (2022) Data-driven chance constrained programs over Wasserstein balls. Oper. Res., ePub ahead of print July 21, https://doi.org/10.1287/opre.2022.2330.Google Scholar
Chernozhukov V, Demirer M, Lewis G, Syrgkanis V (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 14986–14996.Google Scholar
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 208–214.Google Scholar
Cressie N, Read TR (1984) Multinomial goodness-of-fit tests. J. Royal Statist. Soc. B 46(3):440–464.Google Scholar
Daniely A, Sabato S, Shwartz SS (2012) Multiclass learning approaches: A theoretical comparison with implications. Adv. Neural Inform. Proc. Systems 25 (Curran Associates, Red Hook, NY), 485–493.Google Scholar
Daniely A, Sabato S, Ben-David S, Shalev-Shwartz S (2011) Multiclass learnability and the ERM principle. Proc. 24th Annual Conf. on Learn. Theory (PMLR), 207–232.Google Scholar
Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.Link, Google Scholar
Dimakopoulou M, Ren Z, Zhou Z (2021) Online multi-armed bandits with adaptive inference. Adv. Neural Inform. Processing Systems 34:1939–1951.Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Working paper, Stanford University, Stanford, CA.Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2019) Balanced linear contextual bandits. Proc. Conf. AAAI Artificial Intelligence (AAAI Press, Palo Alto, CA) 33:3445–3453.Google Scholar
Duchi J, Namkoong H (2019) Variance-based regularization with convex objectives. J. Machine Learn. Res. 20(1):2450–2504.Google Scholar
Duchi J, Hashimoto T, Namkoong H (2022) Distributionally robust losses for latent covariate mixtures. Oper. Res. ePub ahead of print September 2, https://doi.org/10.1287/opre.2022.2363.Google Scholar
Duchi JC, Namkoong H (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.Crossref, Google Scholar
Duchi JC, Glynn PW, Namkoong H (2021) Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res. 46(3):946–969.Link, Google Scholar
Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Proc. 28th Internat. Conf. on Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
Dudley RM (1967) The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Anal. 1(3):290–330.Crossref, Google Scholar
Faury L, Tanielian U, Dohmatob E, Smirnova E, Vasile F (2020) Distributionally robust counterfactual risk minimization. Proc. Conf. AAAI Artificial Intelligence (AAAI, Palo Alto, CA) 34:3850–3857.Crossref, Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Proc. Systems 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, vol. 1 (Springer, New York).Google Scholar
Gao R, Kleywegt AJ (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res., ePub ahead of print August 5, https://doi.org/10.1287/moor.2022.1275.Google Scholar
Gao R, Xie L, Xie Y, Xu H (2018) Robust hypothesis testing using Wasserstein uncertainty sets. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 7902–7912.Google Scholar
Gerber AS, Green DP, Larimer CW (2008) Social pressure and voter turnout: Evidence from a large-scale field experiment. Amer. Political Sci. Rev. 102(1):33–48.Crossref, Google Scholar
Ghosh S, Lam H (2019) Robust analysis in stochastic simulation: Computation and performance guarantees. Oper. Res. 67(1):232–249.Link, Google Scholar
Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
Hamming RW (1950) Error detecting and error correcting codes. Bell Systems Tech. J. 29(2):147–160.Crossref, Google Scholar
Han Y, Zhou Z, Zhou Z, Blanchet J, Glynn PW, Ye Y (2020) Sequential batch learning in finite-action linear contextual bandits. Preprint, submitted April 14, https://arxiv.org/abs/2004.06321.Google Scholar
Haussler D (1995) Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combinational Theory Ser. A 69(2):217–232.Crossref, Google Scholar
Ho-Nguyen N, Kilinç-Karzan F, Küçükyavuz S, Lee D (2022) Distributionally robust chance-constrained programs with right-hand side uncertainty under Wasserstein ambiguity. Math. Programming 196:641–672.Crossref, Google Scholar
Hu Z, Hong LJ (2013) Kullback-Leibler Divergence Constrained Distributionally Robust Optimization (Optimization Online).Google Scholar
Imbens G, Rubin D (2015) Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
Joachims T, Swaminathan A, Rijke MD (2018) Deep learning with logged bandit feedback. Proc. Internat. Conf. on Learn. Representations.Google Scholar
Jun KS, Bhargava A, Nowak R, Willett R (2017) Scalable generalized linear bandits: Online computation and hashing. Adv. Neural Inform. Proc. Systems 30 (Curran Associates, Red Hook, NY), 99–109.Google Scholar
Kallus N (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8895–8906.Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
Kitagawa T, Tetenov A (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.Abstract, Google Scholar
Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.Crossref, Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Lee J, Raginsky M (2018) Minimax statistical learning with Wasserstein distances. Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2692–2701.Google Scholar
Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. on Machine Learn., vol. 70, 2071–2080.Google Scholar
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. on World Wide Web (ACM, New York), 661–670.Google Scholar
Liu Z, Bai Q, Blanchet J, Dong P, Xu W, Zhou Z, Zhou Z (2022) Distributional robust Q-learning. Proc. Internat. Conf. on Machine Learn. (PMLR).Google Scholar
Luenberger DG, Ye Y (2008) Linear and Nonlinear Programming, International Series in Operations Research & Management Science, vol. 116 (Springer, New York), 546.Google Scholar
Mohajerin Esfahani P, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1):115–166.Crossref, Google Scholar
Namkoong H, Duchi JC (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2216–2224.Google Scholar
Natarajan BK (1989) On learning sets and functions. Machine Learn. 4(1):67–97.Crossref, Google Scholar
Nguyen VA, Kuhn D, Mohajerin Esfahani P (2022) Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. Oper. Res. 70(1):490–515.Link, Google Scholar
Panaganti K, Kalathil D (2022) Sample complexity of robust reinforcement learning with a generative model. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 9582–9602.Google Scholar
Rakhlin A, Sridharan K (2016) BISTRO: An efficient relaxation-based method for contextual bandits. Proc. Internat. Conf. on Machine Learn. (JMLR), 1977–1985.Google Scholar
Ren Z, Zhou Z (2020) Dynamic batch learning in high-dimensional sparse linear contextual bandits. Preprint, submitted August 27, https://arxiv.org/abs/ 2008.11918.Google Scholar
Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/ 1003.1630.Google Scholar
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.Crossref, Google Scholar
Ruder S (2016) An overview of gradient descent optimization algorithms. Preprint, submitted September 15, https://arxiv.org/abs/ 1609.04747.Google Scholar
Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
Schwartz EM, Bradlow ET, Fader PS (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.Link, Google Scholar
Shafieezadeh-Abadeh S, Esfahani P, Kuhn D (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
Shapiro A (2017) Distributionally robust stochastic programming. SIAM J. Optim. 27(4):2258–2275.Crossref, Google Scholar
Si N, Zhang F, Zhou Z, Blanchet J (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Proc. Internat. Conf. on Machine Learn. (PMLR), 8884–8894.Google Scholar
Sinha A, Namkoong H, Duchi J (2018) Certifying some distributional robustness with principled adversarial training. Proc. Internat. Conf. on Learn. Representations (OpenReview.net).Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
Smirnova E, Dohmatob E, Mary J (2019) Distributionally robust reinforcement learning. Preprint, submitted February 23, https://arxiv.org/abs/ 1902.08708.Google Scholar
Staib M, Jegelka S (2017) Distributionally robust deep learning as a generalization of adversarial training. 31st Conf. Neural Inform. Proc. Systems (Curran Associates, Red Hook, NY).Google Scholar
Swaminathan A, Joachims T (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16:1731–1755.Google Scholar
Swaminathan A, Joachims T (2015b) The self-normalized estimator for counterfactual learning. Advances in Neural Information Processing Systems (Citeseer), 3231–3239.Google Scholar
Vapnik V, Chervonenkis AY (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probability Appl. 16(2):264.Crossref, Google Scholar
Volpi R, Namkoong H, Sener O, Duchi JC, Murino V, Savarese S (2018) Generalizing to unseen domains via adversarial data augmentation. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
Yang I (2020) Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Trans. Automated Control 66(8):3863–3870.Crossref, Google Scholar
Zhan R, Ren Z, Athey S, Zhou Z (2021) Policy learning with adaptively collected data. Preprint, submitted May 5, https://arxiv.org/abs/ 2105.02344.Google Scholar
Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E (2012) Estimating optimal treatment regimes from a classification perspective. Statistics 1(1):103–114.Crossref, Google Scholar
Zhao C, Guan Y (2018) Data-driven risk-averse stochastic optimization with Wasserstein metric. Oper. Res. Lett. 46(2):262–267.Crossref, Google Scholar
Zhao C, Jiang R (2017) Distributionally robust contingency-constrained unit commitment. IEEE Trans. Power Systems 33(1):94–102.Crossref, Google Scholar
Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012) Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107(499):1106–1118.Crossref, Google Scholar
Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR (2014) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.Crossref, Google Scholar
Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.Crossref, Google Scholar
Zhou Z, Athey S, Wager S (2018) Offline multi-action policy learning: Generalization and optimization. Preprint, submitted October 10, https://arxiv.org/abs/ 1810.04778.Google Scholar
Zhou Z, Xu R, Blanchet J (2019) Learning in generalized linear contextual bandits with stochastic delays. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 32.Google Scholar
Zhou Z, Zhou Z, Bai Q, Qiu L, Blanchet J, Glynn P (2021) Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 3331–3339.Google Scholar

Volume 69, Issue 10

October 2023

Pages 5695-6415, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:June 03, 2020
Accepted:May 02, 2022
Published Online:March 31, 2023

Cite as

Nian Si, Fan Zhang, Zhengyuan Zhou, Jose Blanchet (2023) Distributionally Robust Batch Contextual Bandits. Management Science 69(10):5772-5793.

https://doi.org/10.1287/mnsc.2023.4678

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Distributionally Robust Batch Contextual Bandits

References

Volume 69, Issue 10

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News