Optimal Learning for Structured Bandits

Published Online:https://doi.org/10.1287/mnsc.2020.02108

References

  • Agrawal R, Teneketzis D, Anantharam V (1989) Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Trans. Automat. Control 34(3):258–267.CrossrefGoogle Scholar
  • Aubin JP, Frankowska H (2009) Set-Valued Analysis (Springer Science & Business Media, New York).CrossrefGoogle Scholar
  • Balseiro S, Golrezaei N, Mahdian M, Mirrokni V, Schneider J (2019) Contextual bandits with cross-learning. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 9676–9685.Google Scholar
  • Bank B, Guddat J, Klatte D, Kummer B, Tammer K (1982) Non-Linear Parametric Optimization (Springer, Wiesbaden, Germany).CrossrefGoogle Scholar
  • Barvinok A (2002) A Course in Convexity, vol. 54 (American Mathematical Society, Providence, RI).CrossrefGoogle Scholar
  • Berge C (1997) Topological Spaces: Including a Treatment of Multi-Valued Functions, Vector Spaces, and Convexity (Dover Publications, Mineola, NY).Google Scholar
  • Bertsekas D (2009) Convex Optimization Theory (Athena Scientific, Belmont, MA).Google Scholar
  • Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.LinkGoogle Scholar
  • Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends. Machine Learning 5(1):1–122.CrossrefGoogle Scholar
  • Bubeck S, Devanur N, Huang Z, Niazadeh R (2019) Multi-scale online learning: Theory and applications to online auctions and pricing. J. Mach. Learn. Res. 20:1–37. Google Scholar
  • Cappé O, Garivier A, Maillard OA, Munos R, Stoltz G (2013) Kullback-leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.CrossrefGoogle Scholar
  • Chandrasekaran V, Shah P (2017) Relative entropy optimization and its applications. Math. Programming 161(1–2):1–32.CrossrefGoogle Scholar
  • Combes R, Magureanu S, Proutiere A (2017) Minimal exploration in structured stochastic bandits. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 1763–1771.Google Scholar
  • Combes R, Proutiere A (2014) Unimodal bandits: Regret lower bounds and optimal algorithms. Internat. Conf. Machine Learn., 521–529.Google Scholar
  • Combettes P (2018) Perspective functions: Properties, constructions, and examples. Set-Valued Var. Anal. 26(2):247–264.CrossrefGoogle Scholar
  • Cover T, Thomas J (2012) Elements of Information Theory (John Wiley & Sons, New York).Google Scholar
  • Dani V, Hayes T, Kakade S (2008) Stochastic linear optimization under bandit feedback. Internat. Conf. Algorithmic Learn. Theory, 355–366.Google Scholar
  • Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. NIPS 23:586–594.Google Scholar
  • Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory, 359–376.Google Scholar
  • Golrezaei N, Javanmard A, Mirrokni V (2019) Dynamic incentive-aware learning: Robust pricing in contextual auctions. Adv. Neural Inf. Process. Syst. (NIPS, San Diego), 9756–9766.Google Scholar
  • Graves T, Lai T (1997) Asymptotically efficient adaptive choice of control laws incontrolled Markov chains. SIAM J. Control Optim. 35(3):715–743.CrossrefGoogle Scholar
  • Gupta S, Chaudhari S, Joshi G, Yağan O (2021) Multi-armed bandits with correlated arms. IEEE Trans. Inform. Theory 67(10):6711–6732.CrossrefGoogle Scholar
  • Gupta S, Chaudhari S, Mukherjee S, Joshi G, Yağan O (2020) A unified approach to translate classical bandit algorithms to the structured bandit setting. IEEE J. Selected Areas Inform. Theory 1(3):840–853.CrossrefGoogle Scholar
  • Hoeffding W (1994) Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding (Springer, New York), 409–426.CrossrefGoogle Scholar
  • Jun KS, Zhang C (2020) Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 6366–6376.Google Scholar
  • Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. International Conference on Algorithmic Learning Theory (Springer, New York), 199–213.Google Scholar
  • Keskin N, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.LinkGoogle Scholar
  • Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Lattimore T, Munos R (2014) Bounded regret for finite-armed structured bandits. Adv. Neural Inf. Process. Syst. 27:550–558.Google Scholar
  • Lattimore T, Szepesvari C (2017) The end of optimism? An asymptotic analysis of finite-armed linear bandits. Proceedings of Machine Learning Research, vol. 54 (PMLR, Fort Lauderdale, FL), 728–737Google Scholar
  • Magureanu S, Combes R, Proutiere A (2014) Lipschitz bandits: Regret lower bound and optimal algorithms. Conf. Learn. Theory, Proceedings of Machine Learning Research Series, vol. 35 (PMLR, New York), 975–999.Google Scholar
  • Mao J, Leme R, Schneider J (2018) Contextual pricing for Lipschitz buyers. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 5643–5651.Google Scholar
  • Mersereau A, Rusmevichientong P, Tsitsiklis J (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automat. Control. 54(12):2787–2802.CrossrefGoogle Scholar
  • Nesterov Y (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, Amsterdam).CrossrefGoogle Scholar
  • Niazadeh R, Golrezaei N, Wang J, Susan F, Badanidiyuru A (2020) Online learning via offline greedy: Applications in market design and optimization. Preprint, submitted May 29, http://dx.doi.org/10.2139/ssrn.3613756.Google Scholar
  • Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.LinkGoogle Scholar
  • Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.LinkGoogle Scholar
  • Slivkins A (2011) Contextual bandits with similarity information. Proc. 24th Annual Conf. Learn. Theory, 679–702.Google Scholar
  • Streeter M, Golovin D (2008) An online algorithm for maximizing submodular functions. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inf. Process. Syst. (Curran Associates, Inc., Red Hook, NY), 1577–1584.Google Scholar
  • Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Zhang M, Shen Z, Mokhtari A, Hassani H, Karbasi A (2020) One sample stochastic frank-wolfe. Internat. Conf. Artificial Intelligence Statist., 4012–4023.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.