Optimal Learning for Structured Bandits

Bart Van Parys
Corresponding Author
Bart Van Parys
[email protected]
https://orcid.org/0000-0003-4177-4849
Massachusetts Institute of Technology, Sloan School of Management, Cambridge, Massachusetts 02142
Search for more papers by this author
,
Negin Golrezaei
Negin Golrezaei
[email protected]
https://orcid.org/0000-0002-5782-1534
Massachusetts Institute of Technology, Sloan School of Management, Cambridge, Massachusetts 02142
Search for more papers by this author

Bart Van Parys

Corresponding Author

Bart Van Parys

[email protected]

https://orcid.org/0000-0003-4177-4849

Massachusetts Institute of Technology, Sloan School of Management, Cambridge, Massachusetts 02142

Search for more papers by this author

Negin Golrezaei

[email protected]

https://orcid.org/0000-0002-5782-1534

Massachusetts Institute of Technology, Sloan School of Management, Cambridge, Massachusetts 02142

Search for more papers by this author

Published Online:16 Aug 2023https://doi.org/10.1287/mnsc.2020.02108

References

Agrawal R, Teneketzis D, Anantharam V (1989) Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Trans. Automat. Control 34(3):258–267.Crossref, Google Scholar
Aubin JP, Frankowska H (2009) Set-Valued Analysis (Springer Science & Business Media, New York).Crossref, Google Scholar
Balseiro S, Golrezaei N, Mahdian M, Mirrokni V, Schneider J (2019) Contextual bandits with cross-learning. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 9676–9685.Google Scholar
Bank B, Guddat J, Klatte D, Kummer B, Tammer K (1982) Non-Linear Parametric Optimization (Springer, Wiesbaden, Germany).Crossref, Google Scholar
Barvinok A (2002) A Course in Convexity, vol. 54 (American Mathematical Society, Providence, RI).Crossref, Google Scholar
Berge C (1997) Topological Spaces: Including a Treatment of Multi-Valued Functions, Vector Spaces, and Convexity (Dover Publications, Mineola, NY).Google Scholar
Bertsekas D (2009) Convex Optimization Theory (Athena Scientific, Belmont, MA).Google Scholar
Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends. Machine Learning 5(1):1–122.Crossref, Google Scholar
Bubeck S, Devanur N, Huang Z, Niazadeh R (2019) Multi-scale online learning: Theory and applications to online auctions and pricing. J. Mach. Learn. Res. 20:1–37. Google Scholar
Cappé O, Garivier A, Maillard OA, Munos R, Stoltz G (2013) Kullback-leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.Crossref, Google Scholar
Chandrasekaran V, Shah P (2017) Relative entropy optimization and its applications. Math. Programming 161(1–2):1–32.Crossref, Google Scholar
Combes R, Magureanu S, Proutiere A (2017) Minimal exploration in structured stochastic bandits. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 1763–1771.Google Scholar
Combes R, Proutiere A (2014) Unimodal bandits: Regret lower bounds and optimal algorithms. Internat. Conf. Machine Learn., 521–529.Google Scholar
Combettes P (2018) Perspective functions: Properties, constructions, and examples. Set-Valued Var. Anal. 26(2):247–264.Crossref, Google Scholar
Cover T, Thomas J (2012) Elements of Information Theory (John Wiley & Sons, New York).Google Scholar
Dani V, Hayes T, Kakade S (2008) Stochastic linear optimization under bandit feedback. Internat. Conf. Algorithmic Learn. Theory, 355–366.Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. NIPS 23:586–594.Google Scholar
Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory, 359–376.Google Scholar
Golrezaei N, Javanmard A, Mirrokni V (2019) Dynamic incentive-aware learning: Robust pricing in contextual auctions. Adv. Neural Inf. Process. Syst. (NIPS, San Diego), 9756–9766.Google Scholar
Graves T, Lai T (1997) Asymptotically efficient adaptive choice of control laws incontrolled Markov chains. SIAM J. Control Optim. 35(3):715–743.Crossref, Google Scholar
Gupta S, Chaudhari S, Joshi G, Yağan O (2021) Multi-armed bandits with correlated arms. IEEE Trans. Inform. Theory 67(10):6711–6732.Crossref, Google Scholar
Gupta S, Chaudhari S, Mukherjee S, Joshi G, Yağan O (2020) A unified approach to translate classical bandit algorithms to the structured bandit setting. IEEE J. Selected Areas Inform. Theory 1(3):840–853.Crossref, Google Scholar
Hoeffding W (1994) Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding (Springer, New York), 409–426.Crossref, Google Scholar
Jun KS, Zhang C (2020) Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 6366–6376.Google Scholar
Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. International Conference on Algorithmic Learning Theory (Springer, New York), 199–213.Google Scholar
Keskin N, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lattimore T, Munos R (2014) Bounded regret for finite-armed structured bandits. Adv. Neural Inf. Process. Syst. 27:550–558.Google Scholar
Lattimore T, Szepesvari C (2017) The end of optimism? An asymptotic analysis of finite-armed linear bandits. Proceedings of Machine Learning Research, vol. 54 (PMLR, Fort Lauderdale, FL), 728–737Google Scholar
Magureanu S, Combes R, Proutiere A (2014) Lipschitz bandits: Regret lower bound and optimal algorithms. Conf. Learn. Theory, Proceedings of Machine Learning Research Series, vol. 35 (PMLR, New York), 975–999.Google Scholar
Mao J, Leme R, Schneider J (2018) Contextual pricing for Lipschitz buyers. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 5643–5651.Google Scholar
Mersereau A, Rusmevichientong P, Tsitsiklis J (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automat. Control. 54(12):2787–2802.Crossref, Google Scholar
Nesterov Y (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, Amsterdam).Crossref, Google Scholar
Niazadeh R, Golrezaei N, Wang J, Susan F, Badanidiyuru A (2020) Online learning via offline greedy: Applications in market design and optimization. Preprint, submitted May 29, http://dx.doi.org/10.2139/ssrn.3613756.Google Scholar
Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Link, Google Scholar
Slivkins A (2011) Contextual bandits with similarity information. Proc. 24th Annual Conf. Learn. Theory, 679–702.Google Scholar
Streeter M, Golovin D (2008) An online algorithm for maximizing submodular functions. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 1577–1584.Google Scholar
Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Zhang M, Shen Z, Mokhtari A, Hassani H, Karbasi A (2020) One sample stochastic frank-wolfe. Internat. Conf. Artificial Intelligence Statist., 4012–4023.Google Scholar

Volume 70, Issue 6

June 2024

Pages 3381-4165, iii-v

Article Information

Metrics

Information

Received:July 14, 2020
Accepted:May 23, 2023
Published Online:August 16, 2023

Cite as

Bart Van Parys, Negin Golrezaei (2023) Optimal Learning for Structured Bandits. Management Science 70(6):3951-3998.

https://doi.org/10.1287/mnsc.2020.02108

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Optimal Learning for Structured Bandits

References

Volume 70, Issue 6

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News