Distributionally Robust Batch Contextual Bandits
Published Online:31 Mar 2023https://doi.org/10.1287/mnsc.2023.4678
References
- (2018) Wasserstein distributionally robust Kalman filtering. Bengio S et al., eds. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8483–8492.Google Scholar
- (2017) Linear Thompson sampling revisited. Electronic. J. Statist. 11(2):5165–5197.Crossref, Google Scholar
- (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. Sixteenth Inter. Conf. on Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
- (2013b) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. on Machine Learn. (PMLR), 127–135.Google Scholar
- (2017) Efficient policy learning. Technical report.Google Scholar
- (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
- (2015) Data-driven stochastic programming using phi-divergences. The Operations Research Revolution (Institute for Operations Research and the Management Sciences, Catonsville, MD), 1–19.Link, Google Scholar
- (1995) Characterizations of learnability for classes of (0,…,n)-valued functions. J. Comput. System Sci. 50(1):74–86.Crossref, Google Scholar
- (2017) Optimal classification trees. Machine Learn. 106(7):1039–1082.Crossref, Google Scholar
- (2007) A learning approach for interactive marketing to a customer segment. Oper. Res. 55(6):1120–1135.Link, Google Scholar
- (2004) The price of robustness. Oper. Res. 52(1):35–53.Link, Google Scholar
- (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.Link, Google Scholar
- (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
- (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (ACM, New York), 1097–1105.Google Scholar
- (2022) Data-driven chance constrained programs over Wasserstein balls. Oper. Res., ePub ahead of print July 21, https://doi.org/10.1287/opre.2022.2330.Google Scholar
- (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 14986–14996.Google Scholar
- (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 208–214.Google Scholar
- (1984) Multinomial goodness-of-fit tests. J. Royal Statist. Soc. B 46(3):440–464.Google Scholar
- (2012) Multiclass learning approaches: A theoretical comparison with implications. Adv. Neural Inform. Proc. Systems 25 (Curran Associates, Red Hook, NY), 485–493.Google Scholar
- (2011) Multiclass learnability and the ERM principle. Proc. 24th Annual Conf. on Learn. Theory (PMLR), 207–232.Google Scholar
- (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.Link, Google Scholar
- (2021) Online multi-armed bandits with adaptive inference. Adv. Neural Inform. Processing Systems 34:1939–1951.Google Scholar
- (2017) Estimation considerations in contextual bandits. Working paper, Stanford University, Stanford, CA.Google Scholar
- (2019) Balanced linear contextual bandits. Proc. Conf. AAAI Artificial Intelligence (AAAI Press, Palo Alto, CA) 33:3445–3453.Google Scholar
- (2019) Variance-based regularization with convex objectives. J. Machine Learn. Res. 20(1):2450–2504.Google Scholar
- (2022) Distributionally robust losses for latent covariate mixtures. Oper. Res. ePub ahead of print September 2, https://doi.org/10.1287/opre.2022.2363.Google Scholar
- (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.Crossref, Google Scholar
- (2021) Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res. 46(3):946–969.Link, Google Scholar
- (2011) Doubly robust policy evaluation and learning. Proc. 28th Internat. Conf. on Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
- (1967) The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Anal. 1(3):290–330.Crossref, Google Scholar
- (2020) Distributionally robust counterfactual risk minimization. Proc. Conf. AAAI Artificial Intelligence (AAAI, Palo Alto, CA) 34:3850–3857.Crossref, Google Scholar
- (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Proc. Systems 23 (Curran Associates, Red Hook, NY), 586–594.Google Scholar
- (2001) The Elements of Statistical Learning, vol. 1 (Springer, New York).Google Scholar
- (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res., ePub ahead of print August 5, https://doi.org/10.1287/moor.2022.1275.Google Scholar
- (2018) Robust hypothesis testing using Wasserstein uncertainty sets. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 7902–7912.Google Scholar
- (2008) Social pressure and voter turnout: Evidence from a large-scale field experiment. Amer. Political Sci. Rev. 102(1):33–48.Crossref, Google Scholar
- (2019) Robust analysis in stochastic simulation: Computation and performance guarantees. Oper. Res. 67(1):232–249.Link, Google Scholar
- (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
- (1950) Error detecting and error correcting codes. Bell Systems Tech. J. 29(2):147–160.Crossref, Google Scholar
- (2020) Sequential batch learning in finite-action linear contextual bandits. Preprint, submitted April 14, https://arxiv.org/abs/2004.06321.Google Scholar
- (1995) Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combinational Theory Ser. A 69(2):217–232.Crossref, Google Scholar
- (2022) Distributionally robust chance-constrained programs with right-hand side uncertainty under Wasserstein ambiguity. Math. Programming 196:641–672.Crossref, Google Scholar
- (2013) Kullback-Leibler Divergence Constrained Distributionally Robust Optimization (Optimization Online).Google Scholar
- (2015) Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
- (2018) Deep learning with logged bandit feedback. Proc. Internat. Conf. on Learn. Representations.Google Scholar
- (2017) Scalable generalized linear bandits: Online computation and hashing. Adv. Neural Inform. Proc. Systems 30 (Curran Associates, Red Hook, NY), 99–109.Google Scholar
- (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 8895–8906.Google Scholar
- (2018) Confounding-robust policy improvement. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
- (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
- (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.Abstract, Google Scholar
- (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.Crossref, Google Scholar
- (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2018) Minimax statistical learning with Wasserstein distances. Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2692–2701.Google Scholar
- (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. on Machine Learn., vol. 70, 2071–2080.Google Scholar
- (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. on World Wide Web (ACM, New York), 661–670.Google Scholar
- (2022) Distributional robust Q-learning. Proc. Internat. Conf. on Machine Learn. (PMLR).Google Scholar
- Luenberger DG, Ye Y (2008) Linear and Nonlinear Programming, International Series in Operations Research & Management Science, vol. 116 (Springer, New York), 546.Google Scholar
- (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1):115–166.Crossref, Google Scholar
- (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 2216–2224.Google Scholar
- (1989) On learning sets and functions. Machine Learn. 4(1):67–97.Crossref, Google Scholar
- (2022) Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. Oper. Res. 70(1):490–515.Link, Google Scholar
- (2022) Sample complexity of robust reinforcement learning with a generative model. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 9582–9602.Google Scholar
- (2016) BISTRO: An efficient relaxation-based method for contextual bandits. Proc. Internat. Conf. on Machine Learn. (JMLR), 1977–1985.Google Scholar
- (2020) Dynamic batch learning in high-dimensional sparse linear contextual bandits. Preprint, submitted August 27, https://arxiv.org/abs/ 2008.11918.Google Scholar
- (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/ 1003.1630.Google Scholar
- (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.Crossref, Google Scholar
- (2016) An overview of gradient descent optimization algorithms. Preprint, submitted September 15, https://arxiv.org/abs/ 1609.04747.Google Scholar
- (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
- (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.Link, Google Scholar
- (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
- (2017) Distributionally robust stochastic programming. SIAM J. Optim. 27(4):2258–2275.Crossref, Google Scholar
- (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Proc. Internat. Conf. on Machine Learn. (PMLR), 8884–8894.Google Scholar
- (2018) Certifying some distributional robustness with principled adversarial training. Proc. Internat. Conf. on Learn. Representations (OpenReview.net).Google Scholar
- (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
- (2019) Distributionally robust reinforcement learning. Preprint, submitted February 23, https://arxiv.org/abs/ 1902.08708.Google Scholar
- (2017) Distributionally robust deep learning as a generalization of adversarial training. 31st Conf. Neural Inform. Proc. Systems (Curran Associates, Red Hook, NY).Google Scholar
- (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16:1731–1755.Google Scholar
- (2015b) The self-normalized estimator for counterfactual learning. Advances in Neural Information Processing Systems (Citeseer), 3231–3239.Google Scholar
- (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probability Appl. 16(2):264.Crossref, Google Scholar
- (2018) Generalizing to unseen domains via adversarial data augmentation. Adv. Neural Inform. Proc. Systems 31 (Curran Associates, Red Hook, NY), 31.Google Scholar
- (2020) Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Trans. Automated Control 66(8):3863–3870.Crossref, Google Scholar
- (2021) Policy learning with adaptively collected data. Preprint, submitted May 5, https://arxiv.org/abs/ 2105.02344.Google Scholar
- (2012) Estimating optimal treatment regimes from a classification perspective. Statistics 1(1):103–114.Crossref, Google Scholar
- (2018) Data-driven risk-averse stochastic optimization with Wasserstein metric. Oper. Res. Lett. 46(2):262–267.Crossref, Google Scholar
- (2017) Distributionally robust contingency-constrained unit commitment. IEEE Trans. Power Systems 33(1):94–102.Crossref, Google Scholar
- (2012) Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107(499):1106–1118.Crossref, Google Scholar
- (2014) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.Crossref, Google Scholar
- (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.Crossref, Google Scholar
- (2018) Offline multi-action policy learning: Generalization and optimization. Preprint, submitted October 10, https://arxiv.org/abs/ 1810.04778.Google Scholar
- (2019) Learning in generalized linear contextual bandits with stochastic delays. Adv. Neural Inform. Proc. Systems 32 (Curran Associates, Red Hook, NY), 32.Google Scholar
- (2021) Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR), 3331–3339.Google Scholar

