Distributionally Robust Batch Contextual Bandits

Published Online:https://doi.org/10.1287/mnsc.2023.4678

References

  • Abadeh SS, Nguyen VA, Kuhn D, Esfahani PM (2018) Wasserstein distributionally robust Kalman filtering. Bengio S et al., eds. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8483–8492.Google Scholar
  • Abeille M, Lazaric A (2017) Linear Thompson sampling revisited. Electronic. J. Statist. 11(2):5165–5197.CrossrefGoogle Scholar
  • Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. Sixteenth Inter. Conf. on Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
  • Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. on Machine Learn. (PMLR), 127–135.Google Scholar
  • Athey S, Wager S (2017) Efficient policy learning. Technical report.Google Scholar
  • Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. The Operations Research Revolution (Institute for Operations Research and the Management Sciences, Catonsville, MD), 1–19.LinkGoogle Scholar
  • Ben-david S, Cesabianchi N, Haussler D, Long PM (1995) Characterizations of learnability for classes of (0,…,n)-valued functions. J. Comput. System Sci. 50(1):74–86.CrossrefGoogle Scholar
  • Bertsimas D, Dunn J (2017) Optimal classification trees. Machine Learn. 106(7):1039–1082.CrossrefGoogle Scholar
  • Bertsimas D, Mersereau AJ (2007) A learning approach for interactive marketing to a customer segment. Oper. Res. 55(6):1120–1135.LinkGoogle Scholar
  • Bertsimas D, Sim M (2004) The price of robustness. Oper. Res. 52(1):35–53.LinkGoogle Scholar
  • Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.LinkGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.CrossrefGoogle Scholar
  • Chapelle O (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (ACM, New York), 1097–1105.Google Scholar
  • Chen Z, Kuhn D, Wiesemann W (2022) Data-driven chance constrained programs over Wasserstein balls. Oper. Res., ePub ahead of print July 21, https://doi.org/10.1287/opre.2022.2330.Google Scholar
  • Chernozhukov V, Demirer M, Lewis G, Syrgkanis V (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 14986–14996.Google Scholar
  • Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 208–214.Google Scholar
  • Cressie N, Read TR (1984) Multinomial goodness-of-fit tests. J. Royal Statist. Soc. B 46(3):440–464.Google Scholar
  • Daniely A, Sabato S, Shwartz SS (2012) Multiclass learning approaches: A theoretical comparison with implications. Adv. Neural Inform. Proc. Systems 25 (Curran Associates, Red Hook, NY), 485–493.Google Scholar
  • Daniely A, Sabato S, Ben-David S, Shalev-Shwartz S (2011) Multiclass learnability and the ERM principle. Proc. 24th Annual Conf. on Learn. Theory (PMLR), 207–232.Google Scholar
  • Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.LinkGoogle Scholar
  • Dimakopoulou M, Ren Z, Zhou Z (2021) Online multi-armed bandits with adaptive inference. Adv. Neural Inform. Processing Systems 34:1939–1951.Google Scholar
  • Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Working paper, Stanford University, Stanford, CA.Google Scholar
  • Dimakopoulou M, Zhou Z, Athey S, Imbens G (2019) Balanced linear contextual bandits. Proc. Conf. AAAI Artificial Intelligence (AAAI Press, Palo Alto, CA) 33:3445–3453.Google Scholar
  • Duchi J, Namkoong H (2019) Variance-based regularization with convex objectives. J. Machine Learn. Res. 20(1):2450–2504.Google Scholar
  • Duchi J, Hashimoto T, Namkoong H (2022) Distributionally robust losses for latent covariate mixtures. Oper. Res. ePub ahead of print September 2, https://doi.org/10.1287/opre.2022.2363.Google Scholar
  • Duchi JC, Namkoong H (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.CrossrefGoogle Scholar
  • Duchi JC, Glynn PW, Namkoong H (2021) Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res. 46(3):946–969.LinkGoogle Scholar
  • Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Proc. 28th Internat. Conf. on Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
  • Dudley RM (1967) The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Anal. 1(3):290–330.CrossrefGoogle Scholar
  • Faury L, Tanielian U, Dohmatob E, Smirnova E, Vasile F (2020) Distributionally robust counterfactual risk minimization. Proc. Conf. AAAI Artificial Intelligence (AAAI, Palo Alto, CA) 34:3850–3857.CrossrefGoogle Scholar
  • Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Proc. Systems 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
  • Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, vol. 1 (Springer, New York).Google Scholar
  • Gao R, Kleywegt AJ (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res., ePub ahead of print August 5, https://doi.org/10.1287/moor.2022.1275.Google Scholar
  • Gao R, Xie L, Xie Y, Xu H (2018) Robust hypothesis testing using Wasserstein uncertainty sets. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 7902–7912.Google Scholar
  • Gerber AS, Green DP, Larimer CW (2008) Social pressure and voter turnout: Evidence from a large-scale field experiment. Amer. Political Sci. Rev. 102(1):33–48.CrossrefGoogle Scholar
  • Ghosh S, Lam H (2019) Robust analysis in stochastic simulation: Computation and performance guarantees. Oper. Res. 67(1):232–249.LinkGoogle Scholar
  • Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.LinkGoogle Scholar
  • Hamming RW (1950) Error detecting and error correcting codes. Bell Systems Tech. J. 29(2):147–160.CrossrefGoogle Scholar
  • Han Y, Zhou Z, Zhou Z, Blanchet J, Glynn PW, Ye Y (2020) Sequential batch learning in finite-action linear contextual bandits. Preprint, submitted April 14, https://arxiv.org/abs/2004.06321.Google Scholar
  • Haussler D (1995) Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combinational Theory Ser. A 69(2):217–232.CrossrefGoogle Scholar
  • Ho-Nguyen N, Kilinç-Karzan F, Küçükyavuz S, Lee D (2022) Distributionally robust chance-constrained programs with right-hand side uncertainty under Wasserstein ambiguity. Math. Programming 196:641–672.CrossrefGoogle Scholar
  • Hu Z, Hong LJ (2013) Kullback-Leibler Divergence Constrained Distributionally Robust Optimization (Optimization Online).Google Scholar
  • Imbens G, Rubin D (2015) Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.CrossrefGoogle Scholar
  • Joachims T, Swaminathan A, Rijke MD (2018) Deep learning with logged bandit feedback. Proc. Internat. Conf. on Learn. Representations.Google Scholar
  • Jun KS, Bhargava A, Nowak R, Willett R (2017) Scalable generalized linear bandits: Online computation and hashing. Adv. Neural Inform. Proc. Systems 30 (Curran Associates, Red Hook, NY), 99–109.Google Scholar
  • Kallus N (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8895–8906.Google Scholar
  • Kallus N, Zhou A (2018) Confounding-robust policy improvement. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
  • Kitagawa T, Tetenov A (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.CrossrefGoogle Scholar
  • Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.AbstractGoogle Scholar
  • Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.CrossrefGoogle Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Lee J, Raginsky M (2018) Minimax statistical learning with Wasserstein distances. Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2692–2701.Google Scholar
  • Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. on Machine Learn., vol. 70, 2071–2080.Google Scholar
  • Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. on World Wide Web (ACM, New York), 661–670.Google Scholar
  • Liu Z, Bai Q, Blanchet J, Dong P, Xu W, Zhou Z, Zhou Z (2022) Distributional robust Q-learning. Proc. Internat. Conf. on Machine Learn. (PMLR).Google Scholar
  • Luenberger DG, Ye Y (2008) Linear and Nonlinear Programming, International Series in Operations Research & Management Science, vol. 116 (Springer, New York), 546.Google Scholar
  • Mohajerin Esfahani P, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1):115–166.CrossrefGoogle Scholar
  • Namkoong H, Duchi JC (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2216–2224.Google Scholar
  • Natarajan BK (1989) On learning sets and functions. Machine Learn. 4(1):67–97.CrossrefGoogle Scholar
  • Nguyen VA, Kuhn D, Mohajerin Esfahani P (2022) Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. Oper. Res. 70(1):490–515.LinkGoogle Scholar
  • Panaganti K, Kalathil D (2022) Sample complexity of robust reinforcement learning with a generative model. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 9582–9602.Google Scholar
  • Rakhlin A, Sridharan K (2016) BISTRO: An efficient relaxation-based method for contextual bandits. Proc. Internat. Conf. on Machine Learn. (JMLR), 1977–1985.Google Scholar
  • Ren Z, Zhou Z (2020) Dynamic batch learning in high-dimensional sparse linear contextual bandits. Preprint, submitted August 27, https://arxiv.org/abs/ 2008.11918.Google Scholar
  • Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/ 1003.1630.Google Scholar
  • Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.CrossrefGoogle Scholar
  • Ruder S (2016) An overview of gradient descent optimization algorithms. Preprint, submitted September 15, https://arxiv.org/abs/ 1609.04747.Google Scholar
  • Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.LinkGoogle Scholar
  • Schwartz EM, Bradlow ET, Fader PS (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.LinkGoogle Scholar
  • Shafieezadeh-Abadeh S, Esfahani P, Kuhn D (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
  • Shapiro A (2017) Distributionally robust stochastic programming. SIAM J. Optim. 27(4):2258–2275.CrossrefGoogle Scholar
  • Si N, Zhang F, Zhou Z, Blanchet J (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Proc. Internat. Conf. on Machine Learn. (PMLR), 8884–8894.Google Scholar
  • Sinha A, Namkoong H, Duchi J (2018) Certifying some distributional robustness with principled adversarial training. Proc. Internat. Conf. on Learn. Representations (OpenReview.net).Google Scholar
  • Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
  • Smirnova E, Dohmatob E, Mary J (2019) Distributionally robust reinforcement learning. Preprint, submitted February 23, https://arxiv.org/abs/ 1902.08708.Google Scholar
  • Staib M, Jegelka S (2017) Distributionally robust deep learning as a generalization of adversarial training. 31st Conf. Neural Inform. Proc. Systems (Curran Associates, Red Hook, NY).Google Scholar
  • Swaminathan A, Joachims T (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16:1731–1755.Google Scholar
  • Swaminathan A, Joachims T (2015b) The self-normalized estimator for counterfactual learning. Advances in Neural Information Processing Systems (Citeseer), 3231–3239.Google Scholar
  • Vapnik V, Chervonenkis AY (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probability Appl. 16(2):264.CrossrefGoogle Scholar
  • Volpi R, Namkoong H, Sener O, Duchi JC, Murino V, Savarese S (2018) Generalizing to unseen domains via adversarial data augmentation. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
  • Yang I (2020) Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Trans. Automated Control 66(8):3863–3870.CrossrefGoogle Scholar
  • Zhan R, Ren Z, Athey S, Zhou Z (2021) Policy learning with adaptively collected data. Preprint, submitted May 5, https://arxiv.org/abs/ 2105.02344.Google Scholar
  • Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E (2012) Estimating optimal treatment regimes from a classification perspective. Statistics 1(1):103–114.CrossrefGoogle Scholar
  • Zhao C, Guan Y (2018) Data-driven risk-averse stochastic optimization with Wasserstein metric. Oper. Res. Lett. 46(2):262–267.CrossrefGoogle Scholar
  • Zhao C, Jiang R (2017) Distributionally robust contingency-constrained unit commitment. IEEE Trans. Power Systems 33(1):94–102.CrossrefGoogle Scholar
  • Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012) Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107(499):1106–1118.CrossrefGoogle Scholar
  • Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR (2014) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.CrossrefGoogle Scholar
  • Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.CrossrefGoogle Scholar
  • Zhou Z, Athey S, Wager S (2018) Offline multi-action policy learning: Generalization and optimization. Preprint, submitted October 10, https://arxiv.org/abs/ 1810.04778.Google Scholar
  • Zhou Z, Xu R, Blanchet J (2019) Learning in generalized linear contextual bandits with stochastic delays. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 32.Google Scholar
  • Zhou Z, Zhou Z, Bai Q, Qiu L, Blanchet J, Glynn P (2021) Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 3331–3339.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.