A General Framework for Bandit Problems Beyond Cumulative Objectives

Published Online:https://doi.org/10.1287/moor.2022.1335

References

  • [1] Agrawal S, Devanur NR (2019) Bandits with global convex constraints and objective. Oper. Res. 67(5):1486–1502.LinkGoogle Scholar
  • [2] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory, vol. 23, Proceedings of Machine Learning Research Series (PMLR, Edinburgh, UK), 39.1–39.26.Google Scholar
  • [3] Artzner P, Delbaen F, Eber JM, Heath D (1999) Coherent measures of risk. Math. Finance 9(3):203–228.CrossrefGoogle Scholar
  • [4] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.CrossrefGoogle Scholar
  • [5] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proc. 36th Annual Sympos. Foundations Comput. Sci. (IEEE, Washington, DC), 322–331.Google Scholar
  • [6] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends® in Machine Learn. 5(1):1–122.Google Scholar
  • [7] Cassel A, Mannor S, Zeevi A (2018) A general approach to multi-armed bandits under risk criteria. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, vol. 75, Proceedings of Machine Learning Research Series (PMLR), 1295–1306.Google Scholar
  • [8] David Y, Szörényi B, Ghavamzadeh M, Mannor S, Shimkin N (2018) PAC bandits with risk constraints. Internat. Sympos. Artificial Intelligence Math. ISAIM 2018, Fort Lauderdale, FL. http://isaim2018.cs.virginia.edu/papers/ISAIM2018_ML_David_etal.pdf.Google Scholar
  • [9] Fisher E (1992) On the law of the iterated logarithm for martingales. Ann. Probab. 20(2):675–680.CrossrefGoogle Scholar
  • [10] Fournier N, Guillin A (2015) On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162(3):707–738.CrossrefGoogle Scholar
  • [11] Galichet N, Sebag M, Teytaud O (2013) Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Ong CS, Ho TB, eds. Proc. 5th Asian Conf. Machine Learn., vol. 29, Proceedings of Machine Learning Research Series (PMLR, Australian National University, Canberra, Australia). 245–260.Google Scholar
  • [12] Holland MJ, Haress EM (2020) Learning with CVaR-based feedback under potentially heavy tails. Preprint, submitted June 3, https://arxiv.org/abs/2006.02001.Google Scholar
  • [13] Jiang DR, Powell WB (2018) Risk-averse approximate dynamic programming with quantile-based risk measures. Math. Oper. Res. 43(2):554–579.LinkGoogle Scholar
  • [14] Kagrecha A, Nair J, Jagannathan K (2019) Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., New York), 11272–11281.Google Scholar
  • [15] Klenke A (2014) Law of the Iterated Logarithm (Springer, London), 509–519.Google Scholar
  • [16] LA P, Bhat SP (2022) A Wasserstein distance approach for concentration of empirical risk estimates. J. Machine Learn. Res. 23(238):1–61.Google Scholar
  • [17] LA P, Jagannathan K, Kolla R (2020) Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. Proc. Machine Learn. Systems (PMLR), 3657–3666.Google Scholar
  • [18] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • [19] Maillard OA (2013) Robust risk-averse stochastic multi-armed bandits. Jain S, Munos R, Stephan F, Zeugmann T, eds. Algorithmic Learning Theory (Springer, Cham, Switzerland), 218–233.Google Scholar
  • [20] Maillard OA, Munos R, Stoltz G (2011) A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. Kakade SM, von Luxburg U, eds. Proc. 24th Annual Conf. Learn. Theory, vol. 19, Proceedings of Machine Learning Research Series (PMLR, Budapest, Hungary), 497–514.Google Scholar
  • [21] Massart P (1990) The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 18(3):1269–1283.CrossrefGoogle Scholar
  • [22] Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58(5):527–535.CrossrefGoogle Scholar
  • [23] Sani A, Lazaric A, Munos R (2012) Risk-aversion in multi-armed bandits. Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 3275–3283.Google Scholar
  • [24] Simonnet M (1996) The Strong Law of Large Numbers (Springer, New York), 311–325.Google Scholar
  • [25] Tamkin A, Keramati R, Dann C, Brunskill E (2019) Distributionally-aware exploration for CVaR bandits. NeurIPS 2019 Workshop Safety Robustness Decision Making RLDM 2019.Google Scholar
  • [26] Torossian L, Garivier A, Picheny V (2019) X-armed bandits: Optimizing quantiles, CVaR and other risks. Lee WS, Suzuki T, eds. Proc. Machine Learn. Res., vol. 101, Proceedings of Machine Learning Research Series (PMLR, Nagoya, Japan), 252–267.Google Scholar
  • [27] Tran-Thanh L, Yu JY (2014) Functional bandits. Preprint, submitted May 10, https://arxiv.org/abs/1405.2432.Google Scholar
  • [28] Vakili S, Zhao Q (2015) Mean-variance and value at risk in multi-armed bandit problems. 2015 53rd Annual Allerton Conf. Comm. Control Comput. (Allerton) (IEEE), 1330–1335.Google Scholar
  • [29] Vakili S, Zhao Q (2016) Risk-averse multi-armed bandit problems under mean-variance measure. IEEE J. Selected Topics Signal Processing 10(6):1093–1111.CrossrefGoogle Scholar
  • [30] Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
  • [31] Yu X, King I, Lyu MR (2017) Risk control of best arm identification in multi-armed bandits via successive rejects. 2017 IEEE Internat. Conf. Data Mining (ICDM) (IEEE, Los Alamitos, CA), 1147–1152.Google Scholar
  • [32] Zhu Q, Tan V (2020) Thompson sampling algorithms for mean-variance bandits. Proc. Machine Learn. Systems 2020:2645–2654.Google Scholar
  • [33] Zimin A, Ibsen-Jensen R, Chatterjee K (2014) Generalized risk-aversion in stochastic multi-armed bandits. Preprint, submitted May 5, https://arxiv.org/abs/1405.0833.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.