A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Zeyu Zheng
Zeyu Zheng
[email protected]
https://orcid.org/0000-0001-5653-152X
Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720
Search for more papers by this author
,
Feng Zhu
Corresponding Author
Feng Zhu
[email protected]
https://orcid.org/0000-0003-4979-4879
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Zeyu Zheng

[email protected]

https://orcid.org/0000-0001-5653-152X

Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720

Search for more papers by this author

Feng Zhu

Corresponding Author

Feng Zhu

[email protected]

https://orcid.org/0000-0003-4979-4879

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:30 Oct 2024https://doi.org/10.1287/mnsc.2022.03512

References

Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 24 (NIPS 2011) (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Proc. 25th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 23 (PMLR, New York), 39.1–39.26.Google Scholar
Agrawal S, Goyal N (2017) Near-optimal regret bounds for Thompson sampling. J. ACM 64(5):1–24.Crossref, Google Scholar
Agrawal S, Juneja SK, Koolen WM (2021) Regret minimization in heavy-tailed bandits. 34th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 134 (PMLR, New York), 26–62.Google Scholar
Araman VF, Caldentey R (2022) Diffusion approximations for a class of sequential testing problems. Management Sci. 68(8):5958–5979.Google Scholar
Ashutosh K, Nair J, Kagrecha A, Jagannathan K (2021) Bandit algorithms: Letting go of logarithmic regret for statistical robustness. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (PMLR, New York), 622–630.Google Scholar
Audibert JY, Munos R, Szepesvári C (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19):1876–1902.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.Crossref, Google Scholar
Baudry D, Gautron R, Kaufmann E, Maillard O (2021) Optimal Thompson sampling strategies for support-aware CVaR bandits. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139, 716–726.Google Scholar
Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27 (NIPS 2014) (MIT Press, Cambridge, MA), 199–207.Google Scholar
Bhatt S, Fang G, Li P, Samorodnitsky G (2022) Nearly optimal Catoni’s M-estimator for infinite variance. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 1925–1944.Google Scholar
Bouchaud JP, Georges A (1990) Anomalous diffusion in disordered media: Statistical mechanisms, models and physical applications. Phys. Rep. 195(4–5):127–293.Crossref, Google Scholar
Bouchaud JP, Potters M (2000) Theory of Financial Risks: From Statistical Physics to Risk Management, 1st ed. (Cambridge University Press, Cambridge, UK).Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends® Machine Learn. 5(1):1–122.Google Scholar
Bubeck S, Cesa-Bianchi N, Lugosi G (2013) Bandits with heavy tail. IEEE Trans. Inform. Theory 59(11):7711–7717.Crossref, Google Scholar
Cassel A, Mannor S, Zeevi A (2018) A general approach to multi-armed bandits under risk criteria. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1295–1306.Google Scholar
Chang JQ, Tan VY (2022) A unifying theory of Thompson sampling for continuous risk-averse bandits. Proc. 36th AAAI Conf. Artificial Intelligence (AAAI Press, Washington, DC).Google Scholar
Chen N, Yang S, Zhang H (2022) Bridging adversarial and nonstationary multi-armed bandit. Preprint, submitted January 5, https://arxiv.org/abs/2201.01628.Google Scholar
Chen Y, Lee CW, Luo H, Wei CY (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proc. 32nd Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 99 (PMLR, New York), 696–726.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019) Learning to optimize under non-stationarity. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 89 (PMLR, New York), 1079–1087.Google Scholar
Chopra S, Sodhi M (2004) Supply chain breakdown. MIT Sloan Management Rev. 46(1):53–61.Google Scholar
Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Proc. 21st Annual Conf. Learn. Theory (PMLR, New York), 355–366.Google Scholar
Embrechts P, Klüppelberg C, Mikosch T (2013) Modelling Extremal Events for Insurance and Finance. Stochastic Modelling and Applied Probability, vol. 33 (Springer Science & Business Media, Berlin).Google Scholar
Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7(6):1079–1105.Google Scholar
Fan L, Glynn PW (2021a) Diffusion approximations for Thompson sampling. Preprint, submitted May 19, https://arxiv.org/abs/2105.09232.Google Scholar
Fan L, Glynn PW (2021b) The fragility of optimized bandit algorithms. Preprint, submitted September 28, https://arxiv.org/abs/2109.13595.Google Scholar
Faury L, Russac Y, Abeille M, Calauzènes C (2021) Regret bounds for generalized linear bandits under parameter drift. Preprint, submitted March 9, https://arxiv.org/abs/2103.05750.Google Scholar
Galichet N, Sebag M, Teytaud O (2013) Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Proc. 5th Asian Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 29 (PMLR, New York), 245–260.Google Scholar
Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 19 (PMLR, New York), 359–376.Google Scholar
Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. Preprint, submitted May 22, https://arxiv.org/abs/0805.3415.Google Scholar
Kalvit A, Zeevi A (2021) A closer look at the worst-case behavior of multi-armed bandit algorithms. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 8807–8819.Google Scholar
Khajonchotpanya N, Xue Y, Rujeerapaiboon N (2021) A revised approach for risk-averse multi-armed bandits under CVaR criterion. Oper. Res. Lett. 49(4):465–472.Crossref, Google Scholar
Kuang, X, Wager S (2023) Weak signal asymptotics for sequentially randomized experiments. Management Sci., ePub ahead of print December 6, https://doi.org/10.1287/mnsc.2023.4964.Google Scholar
Lattimore T (2017) A scale free algorithm for stochastic bandits with bounded kurtosis. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. Adv. Neural Inform. Processing Systems 30 (NIPS 2017) (Curran Associates, Red Hook, NY), 1583–1592.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Lee K, Yang H, Lim S, Oh S (2020) Optimal algorithms for stochastic multi-armed bandits with heavy tailed rewards. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems 33 (NeurIPS 2020) (Curran Associates, Red Hook, NY), 8452–8462.Google Scholar
Liu S, Jiang J, Li X (2022) Non-stationary bandits with knapsacks. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems 35 (NeurIPS 2022) (Curran Associates, Red Hook, NY), 16522–16532.Google Scholar
Lugosi G, Mendelson S (2019) Mean estimation and regression under heavy-tailed distributions: A survey. Found. Comput. Math. 19(5):1145–1190.Crossref, Google Scholar
Luo H, Wei CY, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. 31st Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1739–1776.Google Scholar
Maillard OA (2013) Robust risk-averse stochastic multi-armed bandits. Jain S, Munos R, Stephan F, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (ALT 2013), Lecture Notes in Computer Science, vol. 8139 (Springer, Berlin), 218–233.Google Scholar
Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Cortes C, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems 28 (NIPS 2015) (MIT Press, Cambridge, MA), 3168–3176.Google Scholar
Prashanth L, Jagannathan K, Kolla RK (2020) Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 5577–5586.Google Scholar
Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted February 18, https://arxiv.org/abs/2202.09036.Google Scholar
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on Thompson sampling. Foundations Trends Machine Learn. 11(1):1–96.Crossref, Google Scholar
Salomon A, Audibert JY (2011) Deviations of stochastic bandit regret. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (ALT 2011), Lecture Notes in Computer Science, vol. 6925 (Springer, Berlin), 159–173.Google Scholar
Sani A, Lazaric A, Munos R (2012) Risk-aversion in multi-armed bandits. Pereira F, Burges CJ, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 25 (NIPS 2012) (Curran Associates, Red Hook, NY), 3275–3283.Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Crossref, Google Scholar
Sodhi MS, Tang CS (2021) Supply chain management for extreme conditions: Research opportunities. J. Supply Chain Management 57(1):7–16.Crossref, Google Scholar
Tamkin A, Keramati R, Dann C, Brunskill E (2019) Distributionally-aware exploration for CVaR bandits. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.Google Scholar
Tao Y, Wu Y, Zhao P, Wang D (2022) Optimal rates of (locally) differentially private heavy-tailed multi-armed bandits. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS) 2022, Proceedings of Machine Learning Research, vol. 151 (PMLR, New York), 1546–1574.Google Scholar
Vakili S, Zhao Q (2016) Risk-averse multi-armed bandit problems under mean-variance measure. IEEE J. Selected Topics Signal Processing 10(6):1093–1111.Crossref, Google Scholar
Yu JY, Mannor S (2009) Piecewise-stationary bandit problems with side observations. Proc. 26th Internat. Conf. Machine Learn. (Association for Computing Machinery, New York), 1177–1184.Google Scholar
Yu X, Shao H, Lyu MR, King I (2018) Pure exploration of multi-armed bandits with heavy-tailed payoffs. Proc. 34th Conf. Uncertainty Artificial Intelligence, 937–946.Google Scholar
Zhou X, Xiong Y, Chen N, Gao X (2021) Regime switching bandits. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 4542–4554.Google Scholar
Zhu Q, Tan V (2020) Thompson sampling algorithms for mean-variance bandits. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 11599–11608.Google Scholar
Zimin A, Ibsen-Jensen R, Chatterjee K (2014) Generalized risk-aversion in stochastic multi-armed bandits. Preprint, submitted May 5, https://arxiv.org/abs/1405.0833.Google Scholar

Volume 71, Issue 7

July 2025

Pages iv-vi, 5419-6318

Article Information

Supplemental Material

Metrics

Information

Received:November 14, 2022
Accepted:April 26, 2024
Published Online:October 30, 2024

Cite as

David Simchi-Levi, Zeyu Zheng, Feng Zhu (2024) A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits. Management Science 71(7):6298-6318.

https://doi.org/10.1287/mnsc.2022.03512

Keywords

Acknowledgments

The authors gratefully acknowledge the editor and the reviewers for their time and comments that greatly helped improve the manuscript. The authors are listed in alphabetical order.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

References

Volume 71, Issue 7

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News