Maximal Objectives in the Multiarmed Bandit with Applications

Eren Ozbay
Eren Ozbay
[email protected]
https://orcid.org/0000-0001-5634-4003
Department of Information and Decision Sciences, The University of Illinois Chicago, Chicago, Illinois 60607
Search for more papers by this author
,
Vijay Kamble
Corresponding Author
Vijay Kamble
[email protected]
https://orcid.org/0000-0002-9261-1612
Department of Information and Decision Sciences, The University of Illinois Chicago, Chicago, Illinois 60607
Search for more papers by this author

Department of Information and Decision Sciences, The University of Illinois Chicago, Chicago, Illinois 60607

Search for more papers by this author

Vijay Kamble

Corresponding Author

Vijay Kamble

[email protected]

https://orcid.org/0000-0002-9261-1612

Department of Information and Decision Sciences, The University of Illinois Chicago, Chicago, Illinois 60607

Search for more papers by this author

Published Online:15 Mar 2024https://doi.org/10.1287/mnsc.2022.00801

References

Agrawal R (1995) Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.Crossref, Google Scholar
Audibert J-Y, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory, Quebec, Canada, 217–226.Google Scholar
Audibert J-Y, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (Omnipress, Madison, WI), 41–53.Google Scholar
Auer P, Ortner R (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1–2):55–65.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.Crossref, Google Scholar
Baudry D, Russac Y, Kaufmann E (2022) Efficient algorithms for extreme bandits. Preprint, submitted March 21, https://arxiv.org/abs/2203.10883.Google Scholar
Bhatt S, Li P, Samorodnitsky G (2022) Extreme bandits using robust statistics. IEEE Trans. Inform. Theory 69(3):1761–1776.Crossref, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Proc. 20th Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 23–37.Google Scholar
Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Comput. Sci. 412(19):1832–1852.Crossref, Google Scholar
Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 258–265.Google Scholar
Carpentier A, Locatelli A (2016) Tight (lower) bounds for the fixed budget best arm identification bandit problem. Proc. 29th Annual Conf. Learn. Theory (JMLR.org), 590–604.Google Scholar
Carpentier A, Valko M (2014) Extreme bandits. Adv. Neural Inform. Processing Systems 27: 28th Annual Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1089–1097.Google Scholar
Carpentier A, Valko M (2015) Simple regret for infinitely many armed bandits. Proc. 32nd Internat. Conf. Machine Learn. (PMLR, New York), 1133–1141.Google Scholar
Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Adv. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1160–1168.Google Scholar
Cicirello VA, Smith SF (2005) The max k-armed bandit: A new model of exploration applied to search heuristic selection. Proc. 20th National Conf. Artificial Intelligence, vol. 3 (AAAI Press, Palo Alto, CA), 1355–1361.Google Scholar
Dekel O, Ding J, Koren T, Peres Y (2014) Bandits with switching costs: T2/3 regret. Proc. 46th Annual ACM Sympos. Theory Comput. (ACM, New York), 459–467.Google Scholar
Esfandiari H, Karbasi A, Mehrabian A, Mirrokni V (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence, vol. 35 (AAAI Press, Palo Alto, CA), 7340–7348.Google Scholar
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. Proc. 15th Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg), 255–270.Google Scholar
Even-Dar E, Mannor S, Mansour Y (2010) Learning with global cost in stochastic environments. Kalai AT, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory, vol. 1 (Omnipress, Madison, WI), 80–92.Google Scholar
Even-Dar E, Kleinberg R, Mannor S, Mansour Y (2009) Online learning for global cost functions. Proc. 22nd Annual Conf. Learn. Theory, Montreal, Canada, 1527–1537.Google Scholar
Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7(6):1079–1105.Google Scholar
Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 503–513.Google Scholar
Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR.org), 359–376.Google Scholar
Garivier A, Lattimore T, Kaufmann E (2016) On explore-then-commit strategies. Adv. Neural Inform. Processing Systems, vol. 29.Google Scholar
Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Gourville JT, Soman D (2005) Overchoice and assortment type: When and why variety backfires. Marketing Sci. 24(3):382–395.Link, Google Scholar
György A, Kocsis L (2011) Efficient multi-start strategies for local search algorithms. J. Artificial Intelligence Res. 41:407–444.Crossref, Google Scholar
Hsu W-K, Xu J, Lin X, Bell MR (2022) Integrated online learning and adaptive control in queueing systems with uncertain payoffs. Oper. Res. 70(2):1166–1181.Link, Google Scholar
Jamieson K, Malloy M, Nowak R, Bubeck S (2014) lil’UCB: An optimal exploration algorithm for multi-armed bandits. Proc. 27th Annual Conf. Learn. Theory (JMLR.org), 423–439.Google Scholar
Jin T, Xu P, Xiao X, Gu Q (2021) Double explore-then-commit: Asymptotic optimality and beyond. Proc. 34th Annual Conf. Learn. Theory (PMLR, New York), 2584–2633.Google Scholar
Johari R, Kamble V, Kanoria Y (2021) Matching while learning. Oper. Res. 69(2):655–681.Link, Google Scholar
Jun K-S, Jamieson K, Nowak R, Zhu X (2016) Top arm identification in multi-armed bandits with batch arm pulls. Artificial Intelligence Statist., 139–148.Google Scholar
Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) PAC subset selection in stochastic multi-armed bandits. Proc. 29th Internat. Conf. Machine Learn., 227–234.Google Scholar
Karnin Z, Koren T, Somekh O (2013) Almost optimal exploration in multi-armed bandits. Proc. 30th Internat. Conf. Machine Learn. (JMLR.org), 1238–1246.Google Scholar
Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 228–251.Google Scholar
Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lattimore T (2018) Refining the confidence level for optimistic bandit strategies. J. Machine Learn. Res. 19(1):765–796.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge).Crossref, Google Scholar
Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.Google Scholar
Martí R, Lozano JA, Mendiburu A, Hernando L (2018) Multi-start methods. Handbook of Heuristics (Springer, Berlin, Heidelberg), 155–175.Crossref, Google Scholar
Martí R, Aceves R, León MT, Moreno-Vega JM, Duarte A (2019) Intelligent multi-start methods. Handbook of Metaheuristics, 221–243.Google Scholar
Massoulié L, Xu K (2018) On the capacity of information processing systems. Oper. Res. 66(2):568–586.Link, Google Scholar
Ozbay E, Kamble V (2022) Incentives for exploration at market equilibrium. Preprint, submitted March 28, https://dx.doi.org/10.2139/ssrn.4041075.Google Scholar
Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.Crossref, Google Scholar
Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends Machine Learn., 11(1):1–96.Crossref, Google Scholar
Settle RB, Golden LL (1974) Consumer perceptions: Overchoice in the market place. Adv. Consumer Res. 1:29–37.Google Scholar
Shah V, Gulikers L, Massoulié L, Vojnović M (2020) Adaptive matching for expert systems with uncertain task types. Oper. Res. 68(5):1403–1424.Link, Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Crossref, Google Scholar
Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surveys 41(6):1–25.Crossref, Google Scholar
Streeter MJ, Smith SF (2006a) An asymptotically optimal algorithm for the max k-armed bandit problem. Proc. 21st National Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 135–142.Google Scholar
Streeter MJ, Smith SF (2006b) A simple distribution-free approach to the max k-armed bandit problem. Proc. 12th Internat. Conf. Principles Practice Constraint Programming (Springer, Berlin, Heidelberg), 560–574.Google Scholar
Sun X, Zhao J (2022) Congestion-aware matching and learning for service platforms.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Vaidhiyan NK, Sundaresan R (2017) Learning to detect an oddball target. IEEE Trans. Inform. Theory 64(2):831–852.Crossref, Google Scholar
Yu H, Zhang D, Rauchwerger L (2004) An adaptive algorithm selection framework. Proc. 13th Internat. Conf. Parallel Architecture Compilation Techniques (IEEE, Piscataway, NJ), 278–289.Google Scholar
Zhou Y, Chen X, Li J (2014) Optimal PAC multiple arm identification with applications to crowdsourcing. Proc. 31st Internat. Conf. Machine Learn. (JMLR.org), 217–225.Google Scholar

Volume 70, Issue 12

December 2024

Pages 8217-9119, iv-vi

Article Information

Supplemental Material

Metrics

Information

Received:March 14, 2022
Accepted:June 09, 2023
Published Online:March 15, 2024

Cite as

Eren Ozbay, Vijay Kamble (2024) Maximal Objectives in the Multiarmed Bandit with Applications. Management Science 70(12):8853-8874.

https://doi.org/10.1287/mnsc.2022.00801

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Maximal Objectives in the Multiarmed Bandit with Applications

References

Volume 70, Issue 12

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News