Technical Note—The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson Sampling

Published Online:https://doi.org/10.1287/opre.2022.2274

References

  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook NY), 2312–2320.Google Scholar
  • Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. PMLR 28(3):127–135.Google Scholar
  • Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(Nov):397–422.Google Scholar
  • Boyd SP, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Bubeck S, Cesa-Bianchi N, Lugosi G (2013) Bandits with heavy tail. IEEE Trans. Inform. Theory 59(11):7711–7717.CrossrefGoogle Scholar
  • Carpentier A, Vernade C, Abbasi-Yadkori Y (2020) The elliptical potential lemma revisited. Preprint, submitted October 20, https://arxiv.org/abs/2010.10182.Google Scholar
  • Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. J. Machine Learn. Res. 15:208–214.Google Scholar
  • Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (Omnipress, Madison, WI), 355–366.Google Scholar
  • Dong S, Van Roy B (2018) An information-theoretic analysis for Thompson sampling with many actions. Bengio S, Wallach HM, eds. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 4161–4169.Google Scholar
  • Hamidi N, Bayati M (2020) On worst-case regret of linear Thompson sampling. Preprint, submitted June 11, https://arxiv.org/abs/2006.06790.Google Scholar
  • Kalkanli C, Özgür A (2020) An improved regret bound for Thompson sampling in the gaussian linear bandit setting. 2020 IEEE Internat. Sympos. Inform. Theory (ISIT) (IEEE, Piscataway, NJ), 2783–2788.Google Scholar
  • Lai TL, Ching ZW (1982) Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. Ann. Statist. 10(1):154–166.CrossrefGoogle Scholar
  • Li Y, Wang Y, Zhou Y (2019) Nearly minimax-optimal regret for linearly parameterized bandits. Preprint, submitted March 30, https://arxiv.org/abs/1904.00242Google Scholar
  • Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.