A Simple and Optimal Policy Design with Safety Against Heavy-Tailed Risk for Stochastic Bandits

Published Online:https://doi.org/10.1287/mnsc.2022.03512

References

  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 24 (NIPS 2011) (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
  • Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Proc. 25th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 23 (PMLR, New York), 39.1–39.26.Google Scholar
  • Agrawal S, Goyal N (2017) Near-optimal regret bounds for Thompson sampling. J. ACM 64(5):1–24.CrossrefGoogle Scholar
  • Agrawal S, Juneja SK, Koolen WM (2021) Regret minimization in heavy-tailed bandits. 34th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 134 (PMLR, New York), 26–62.Google Scholar
  • Araman VF, Caldentey R (2022) Diffusion approximations for a class of sequential testing problems. Management Sci. 68(8):5958–5979.Google Scholar
  • Ashutosh K, Nair J, Kagrecha A, Jagannathan K (2021) Bandit algorithms: Letting go of logarithmic regret for statistical robustness. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (PMLR, New York), 622–630.Google Scholar
  • Audibert JY, Munos R, Szepesvári C (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19):1876–1902.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.CrossrefGoogle Scholar
  • Baudry D, Gautron R, Kaufmann E, Maillard O (2021) Optimal Thompson sampling strategies for support-aware CVaR bandits. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139, 716–726.Google Scholar
  • Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27 (NIPS 2014) (MIT Press, Cambridge, MA), 199–207.Google Scholar
  • Bhatt S, Fang G, Li P, Samorodnitsky G (2022) Nearly optimal Catoni’s M-estimator for infinite variance. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 1925–1944.Google Scholar
  • Bouchaud JP, Georges A (1990) Anomalous diffusion in disordered media: Statistical mechanisms, models and physical applications. Phys. Rep. 195(4–5):127–293.CrossrefGoogle Scholar
  • Bouchaud JP, Potters M (2000) Theory of Financial Risks: From Statistical Physics to Risk Management, 1st ed. (Cambridge University Press, Cambridge, UK).Google Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends® Machine Learn. 5(1):1–122.Google Scholar
  • Bubeck S, Cesa-Bianchi N, Lugosi G (2013) Bandits with heavy tail. IEEE Trans. Inform. Theory 59(11):7711–7717.CrossrefGoogle Scholar
  • Cassel A, Mannor S, Zeevi A (2018) A general approach to multi-armed bandits under risk criteria. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1295–1306.Google Scholar
  • Chang JQ, Tan VY (2022) A unifying theory of Thompson sampling for continuous risk-averse bandits. Proc. 36th AAAI Conf. Artificial Intelligence (AAAI Press, Washington, DC).Google Scholar
  • Chen N, Yang S, Zhang H (2022) Bridging adversarial and nonstationary multi-armed bandit. Preprint, submitted January 5, https://arxiv.org/abs/2201.01628.Google Scholar
  • Chen Y, Lee CW, Luo H, Wei CY (2019) A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proc. 32nd Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 99 (PMLR, New York), 696–726.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019) Learning to optimize under non-stationarity. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 89 (PMLR, New York), 1079–1087.Google Scholar
  • Chopra S, Sodhi M (2004) Supply chain breakdown. MIT Sloan Management Rev. 46(1):53–61.Google Scholar
  • Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Proc. 21st Annual Conf. Learn. Theory (PMLR, New York), 355–366.Google Scholar
  • Embrechts P, Klüppelberg C, Mikosch T (2013) Modelling Extremal Events for Insurance and Finance. Stochastic Modelling and Applied Probability, vol. 33 (Springer Science & Business Media, Berlin).Google Scholar
  • Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7(6):1079–1105.Google Scholar
  • Fan L, Glynn PW (2021a) Diffusion approximations for Thompson sampling. Preprint, submitted May 19, https://arxiv.org/abs/2105.09232.Google Scholar
  • Fan L, Glynn PW (2021b) The fragility of optimized bandit algorithms. Preprint, submitted September 28, https://arxiv.org/abs/2109.13595.Google Scholar
  • Faury L, Russac Y, Abeille M, Calauzènes C (2021) Regret bounds for generalized linear bandits under parameter drift. Preprint, submitted March 9, https://arxiv.org/abs/2103.05750.Google Scholar
  • Galichet N, Sebag M, Teytaud O (2013) Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Proc. 5th Asian Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 29 (PMLR, New York), 245–260.Google Scholar
  • Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 19 (PMLR, New York), 359–376.Google Scholar
  • Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. Preprint, submitted May 22, https://arxiv.org/abs/0805.3415.Google Scholar
  • Kalvit A, Zeevi A (2021) A closer look at the worst-case behavior of multi-armed bandit algorithms. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 8807–8819.Google Scholar
  • Khajonchotpanya N, Xue Y, Rujeerapaiboon N (2021) A revised approach for risk-averse multi-armed bandits under CVaR criterion. Oper. Res. Lett. 49(4):465–472.CrossrefGoogle Scholar
  • Kuang, X, Wager S (2023) Weak signal asymptotics for sequentially randomized experiments. Management Sci., ePub ahead of print December 6, https://doi.org/10.1287/mnsc.2023.4964.Google Scholar
  • Lattimore T (2017) A scale free algorithm for stochastic bandits with bounded kurtosis. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. Adv. Neural Inform. Processing Systems 30 (NIPS 2017) (Curran Associates, Red Hook, NY), 1583–1592.Google Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Lee K, Yang H, Lim S, Oh S (2020) Optimal algorithms for stochastic multi-armed bandits with heavy tailed rewards. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems 33 (NeurIPS 2020) (Curran Associates, Red Hook, NY), 8452–8462.Google Scholar
  • Liu S, Jiang J, Li X (2022) Non-stationary bandits with knapsacks. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems 35 (NeurIPS 2022) (Curran Associates, Red Hook, NY), 16522–16532.Google Scholar
  • Lugosi G, Mendelson S (2019) Mean estimation and regression under heavy-tailed distributions: A survey. Found. Comput. Math. 19(5):1145–1190.CrossrefGoogle Scholar
  • Luo H, Wei CY, Agarwal A, Langford J (2018) Efficient contextual bandits in non-stationary worlds. 31st Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1739–1776.Google Scholar
  • Maillard OA (2013) Robust risk-averse stochastic multi-armed bandits. Jain S, Munos R, Stephan F, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (ALT 2013), Lecture Notes in Computer Science, vol. 8139 (Springer, Berlin), 218–233.Google Scholar
  • Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Cortes C, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems 28 (NIPS 2015) (MIT Press, Cambridge, MA), 3168–3176.Google Scholar
  • Prashanth L, Jagannathan K, Kolla RK (2020) Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 5577–5586.Google Scholar
  • Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted February 18, https://arxiv.org/abs/2202.09036.Google Scholar
  • Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on Thompson sampling. Foundations Trends Machine Learn. 11(1):1–96.CrossrefGoogle Scholar
  • Salomon A, Audibert JY (2011) Deviations of stochastic bandit regret. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (ALT 2011), Lecture Notes in Computer Science, vol. 6925 (Springer, Berlin), 159–173.Google Scholar
  • Sani A, Lazaric A, Munos R (2012) Risk-aversion in multi-armed bandits. Pereira F, Burges CJ, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 25 (NIPS 2012) (Curran Associates, Red Hook, NY), 3275–3283.Google Scholar
  • Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.CrossrefGoogle Scholar
  • Sodhi MS, Tang CS (2021) Supply chain management for extreme conditions: Research opportunities. J. Supply Chain Management 57(1):7–16.CrossrefGoogle Scholar
  • Tamkin A, Keramati R, Dann C, Brunskill E (2019) Distributionally-aware exploration for CVaR bandits. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.Google Scholar
  • Tao Y, Wu Y, Zhao P, Wang D (2022) Optimal rates of (locally) differentially private heavy-tailed multi-armed bandits. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS) 2022, Proceedings of Machine Learning Research, vol. 151 (PMLR, New York), 1546–1574.Google Scholar
  • Vakili S, Zhao Q (2016) Risk-averse multi-armed bandit problems under mean-variance measure. IEEE J. Selected Topics Signal Processing 10(6):1093–1111.CrossrefGoogle Scholar
  • Yu JY, Mannor S (2009) Piecewise-stationary bandit problems with side observations. Proc. 26th Internat. Conf. Machine Learn. (Association for Computing Machinery, New York), 1177–1184.Google Scholar
  • Yu X, Shao H, Lyu MR, King I (2018) Pure exploration of multi-armed bandits with heavy-tailed payoffs. Proc. 34th Conf. Uncertainty Artificial Intelligence, 937–946.Google Scholar
  • Zhou X, Xiong Y, Chen N, Gao X (2021) Regime switching bandits. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 4542–4554.Google Scholar
  • Zhu Q, Tan V (2020) Thompson sampling algorithms for mean-variance bandits. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 11599–11608.Google Scholar
  • Zimin A, Ibsen-Jensen R, Chatterjee K (2014) Generalized risk-aversion in stochastic multi-armed bandits. Preprint, submitted May 5, https://arxiv.org/abs/1405.0833.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.