A General Framework for Bandit Problems Beyond Cumulative Objectives

Asaf Cassel
Corresponding Author
Asaf Cassel
[email protected]
https://orcid.org/0000-0003-3566-6948
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel;
Search for more papers by this author
,
Shie Mannor
Shie Mannor
[email protected]
Faculty of Electrical and Computer Engineering and Faculty of Industrial Engineering and Management, Technion, Israel Institute of Technology, Haifa 3200003, Israel;Nvidia Research, Tel Aviv 6777506, Israel;
Search for more papers by this author
,
Assaf Zeevi
Assaf Zeevi
[email protected]
https://orcid.org/0000-0003-1075-6664
Graduate School of Business, Columbia University, New York, New York 10027;Data Science Institute, Columbia University, New York, New York 10027
Search for more papers by this author

Asaf Cassel

Corresponding Author

Asaf Cassel

[email protected]

https://orcid.org/0000-0003-3566-6948

School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel;

Search for more papers by this author

Shie Mannor

[email protected]

Faculty of Electrical and Computer Engineering and Faculty of Industrial Engineering and Management, Technion, Israel Institute of Technology, Haifa 3200003, Israel;Nvidia Research, Tel Aviv 6777506, Israel;

Search for more papers by this author

Assaf Zeevi

[email protected]

https://orcid.org/0000-0003-1075-6664

Graduate School of Business, Columbia University, New York, New York 10027;Data Science Institute, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:6 Jan 2023https://doi.org/10.1287/moor.2022.1335

References

[1] Agrawal S, Devanur NR (2019) Bandits with global convex constraints and objective. Oper. Res. 67(5):1486–1502.Link, Google Scholar
[2] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory, vol. 23, Proceedings of Machine Learning Research Series (PMLR, Edinburgh, UK), 39.1–39.26.Google Scholar
[3] Artzner P, Delbaen F, Eber JM, Heath D (1999) Coherent measures of risk. Math. Finance 9(3):203–228.Crossref, Google Scholar
[4] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
[5] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proc. 36th Annual Sympos. Foundations Comput. Sci. (IEEE, Washington, DC), 322–331.Google Scholar
[6] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends® in Machine Learn. 5(1):1–122.Google Scholar
[7] Cassel A, Mannor S, Zeevi A (2018) A general approach to multi-armed bandits under risk criteria. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, vol. 75, Proceedings of Machine Learning Research Series (PMLR), 1295–1306.Google Scholar
[8] David Y, Szörényi B, Ghavamzadeh M, Mannor S, Shimkin N (2018) PAC bandits with risk constraints. Internat. Sympos. Artificial Intelligence Math. ISAIM 2018, Fort Lauderdale, FL. http://isaim2018.cs.virginia.edu/papers/ISAIM2018_ML_David_etal.pdf.Google Scholar
[9] Fisher E (1992) On the law of the iterated logarithm for martingales. Ann. Probab. 20(2):675–680.Crossref, Google Scholar
[10] Fournier N, Guillin A (2015) On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162(3):707–738.Crossref, Google Scholar
[11] Galichet N, Sebag M, Teytaud O (2013) Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Ong CS, Ho TB, eds. Proc. 5th Asian Conf. Machine Learn., vol. 29, Proceedings of Machine Learning Research Series (PMLR, Australian National University, Canberra, Australia). 245–260.Google Scholar
[12] Holland MJ, Haress EM (2020) Learning with CVaR-based feedback under potentially heavy tails. Preprint, submitted June 3, https://arxiv.org/abs/2006.02001.Google Scholar
[13] Jiang DR, Powell WB (2018) Risk-averse approximate dynamic programming with quantile-based risk measures. Math. Oper. Res. 43(2):554–579.Link, Google Scholar
[14] Kagrecha A, Nair J, Jagannathan K (2019) Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., New York), 11272–11281.Google Scholar
[15] Klenke A (2014) Law of the Iterated Logarithm (Springer, London), 509–519.Google Scholar
[16] LA P, Bhat SP (2022) A Wasserstein distance approach for concentration of empirical risk estimates. J. Machine Learn. Res. 23(238):1–61.Google Scholar
[17] LA P, Jagannathan K, Kolla R (2020) Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. Proc. Machine Learn. Systems (PMLR), 3657–3666.Google Scholar
[18] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
[19] Maillard OA (2013) Robust risk-averse stochastic multi-armed bandits. Jain S, Munos R, Stephan F, Zeugmann T, eds. Algorithmic Learning Theory (Springer, Cham, Switzerland), 218–233.Google Scholar
[20] Maillard OA, Munos R, Stoltz G (2011) A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. Kakade SM, von Luxburg U, eds. Proc. 24th Annual Conf. Learn. Theory, vol. 19, Proceedings of Machine Learning Research Series (PMLR, Budapest, Hungary), 497–514.Google Scholar
[21] Massart P (1990) The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 18(3):1269–1283.Crossref, Google Scholar
[22] Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58(5):527–535.Crossref, Google Scholar
[23] Sani A, Lazaric A, Munos R (2012) Risk-aversion in multi-armed bandits. Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 3275–3283.Google Scholar
[24] Simonnet M (1996) The Strong Law of Large Numbers (Springer, New York), 311–325.Google Scholar
[25] Tamkin A, Keramati R, Dann C, Brunskill E (2019) Distributionally-aware exploration for CVaR bandits. NeurIPS 2019 Workshop Safety Robustness Decision Making RLDM 2019.Google Scholar
[26] Torossian L, Garivier A, Picheny V (2019) X-armed bandits: Optimizing quantiles, CVaR and other risks. Lee WS, Suzuki T, eds. Proc. Machine Learn. Res., vol. 101, Proceedings of Machine Learning Research Series (PMLR, Nagoya, Japan), 252–267.Google Scholar
[27] Tran-Thanh L, Yu JY (2014) Functional bandits. Preprint, submitted May 10, https://arxiv.org/abs/1405.2432.Google Scholar
[28] Vakili S, Zhao Q (2015) Mean-variance and value at risk in multi-armed bandit problems. 2015 53rd Annual Allerton Conf. Comm. Control Comput. (Allerton) (IEEE), 1330–1335.Google Scholar
[29] Vakili S, Zhao Q (2016) Risk-averse multi-armed bandit problems under mean-variance measure. IEEE J. Selected Topics Signal Processing 10(6):1093–1111.Crossref, Google Scholar
[30] Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
[31] Yu X, King I, Lyu MR (2017) Risk control of best arm identification in multi-armed bandits via successive rejects. 2017 IEEE Internat. Conf. Data Mining (ICDM) (IEEE, Los Alamitos, CA), 1147–1152.Google Scholar
[32] Zhu Q, Tan V (2020) Thompson sampling algorithms for mean-variance bandits. Proc. Machine Learn. Systems 2020:2645–2654.Google Scholar
[33] Zimin A, Ibsen-Jensen R, Chatterjee K (2014) Generalized risk-aversion in stochastic multi-armed bandits. Preprint, submitted May 5, https://arxiv.org/abs/1405.0833.Google Scholar

cover image Mathematics of Operations Research

Volume 48, Issue 4

November 2023

Pages 1811-2382, C2

Article Information

Metrics

Information

Received:November 01, 2020
Accepted:October 11, 2022
Published Online:January 06, 2023

Cite as

Asaf Cassel, Shie Mannor, Assaf Zeevi (2023) A General Framework for Bandit Problems Beyond Cumulative Objectives. Mathematics of Operations Research 48(4):2196-2232.

https://doi.org/10.1287/moor.2022.1335

Keywords

Acknowledgments

The authors thank Ron Amit, Guy Tennenholtz, Nir Baram, and Nadav Merlis for helpful discussions of this work. The authors also thank the anonymous reviewers for providing the examples in Section 6 and additional helpful comments that improved this work. A preliminary version of this work appeared at the Conference on Learning Theory 2018.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A General Framework for Bandit Problems Beyond Cumulative Objectives

References

Volume 48, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News