Small-Loss Bounds for Online Learning with Partial Information

Published Online:https://doi.org/10.1287/moor.2021.1204

References

  • [1] Agarwal A, Krishnamurthy A, Langford J, Luo H, Schapire RE (2017) Open problem: First-order regret bounds for contextual bandits. Proc. 2017 Conf. Learn. Theory, vol. 65, 4–7.Google Scholar
  • [2] Allenberg C, Auer P, Györfi L, Ottucsák G (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. Proc. 17th Internat. Conf. Algorithmic Learn. Theory, 229–243.Google Scholar
  • [3] Allen-Zhu Z, Bubeck S, Li Y (2018) Make the minority great again: First-order regret bound for contextual bandits. Proc. 35th Internat. Conf. Machine Learn., 186–194.Google Scholar
  • [4] Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. Proc. 28th Conf. Learn. Theory, 23–35.Google Scholar
  • [5] Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, 1610–1618.Google Scholar
  • [6] Alon N, Cesa-Bianchi N, Gentile C, Mannor S, Mansour Y, Shamir O (2017) Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput. 46(6):1785–1826.CrossrefGoogle Scholar
  • [7] Audibert J, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11(94):2785–2836.Google Scholar
  • [8] Audibert JY, Bubeck S, Lugosi G (2014) Regret in online combinatorial optimization. Math. Oper. Res. 39(1):31–45.LinkGoogle Scholar
  • [9] Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J. Comput. System Sci. 64(1):48–75.CrossrefGoogle Scholar
  • [10] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • [11] Awerbuch B, Kleinberg RD (2004) Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. Proc. 36th Annual ACM Sympos. Theory Comput., 45–53.Google Scholar
  • [12] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR), 19–26.Google Scholar
  • [13] Blum A, Hartline JD (2005) Near-optimal online auctions. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms, 1156–1163.Google Scholar
  • [14] Blum A, Even-Dar E, Ligett K (2010) Routing without regret: On convergence to Nash equilibria of regret-minimizing algorithms in routing games. Theory Comput. 6(1):179–199.CrossrefGoogle Scholar
  • [15] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Proc. 40th Annual ACM Sympos. Theory Comput., 373–382.Google Scholar
  • [16] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learning 5(1):1–122. https://www.nowpublishers.com/article/Details/MAL-024.Google Scholar
  • [17] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press).CrossrefGoogle Scholar
  • [18] Cesa-Bianchi N, Gentile C, Mansour Y (2013) Regret minimization for reserve prices in second-price auctions. Proc. 24th Annual ACM-SIAM Sympos. Discrete Algorithms, 1190–1204.Google Scholar
  • [19] Cesa-Bianchi N, Lugosi G, Stoltz G (2005) Minimizing regret with label efficient prediction. IEEE Trans. Inform. Theory 51(6):2152–2162.CrossrefGoogle Scholar
  • [20] Cohen A, Hazan T, Koren T (2016) Online learning with feedback graphs without the graphs. Proc. 33rd Internat. Conf. Machine Learn., 811–819.Google Scholar
  • [21] Cover TM (1991) Universal portfolios. Math. Finance 1(1):1–29.CrossrefGoogle Scholar
  • [22] Daniely A, Gonen A, Shalev-Shwartz S (2015) Strongly adaptive online learning. Proc. 32nd Internat. Conf. Machine Learn., 1405–1411.Google Scholar
  • [23] Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos É (2016) Learning in games: Robustness of fast convergence. Annual Conf. Neural Inform. Processing Systems 2016, 4727–4735.Google Scholar
  • [24] Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.CrossrefGoogle Scholar
  • [25] Hannan J (1957) Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, vol. 3 (Princeton University Press), 97–139.Google Scholar
  • [26] Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learn. 69(2–3):169–192.CrossrefGoogle Scholar
  • [27] Herbster M, Warmuth MK (1998) Tracking the best expert. Machine Learn. 32(2):151–178.CrossrefGoogle Scholar
  • [28] Kalai A, Vempala S (2005) Efficient algorithms for online decision problems. J. Comput. System Sci. 71(3):291–307.CrossrefGoogle Scholar
  • [29] Kocák T, Neu G, Valko M (2016) Online learning with noisy side observations. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1186–1194.Google Scholar
  • [30] Kocák T, Neu G, Valko M, Munos R (2014) Efficient learning by implicit exploration in bandit problems with side observations. Adv. Neural Inform. Processing Systems, 613–621.Google Scholar
  • [31] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • [32] Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Proc. 20th Internat. Conf. Neural Inform. Processing Systems, 817–824.Google Scholar
  • [33] Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inform. Comput. 108(2):212–261.CrossrefGoogle Scholar
  • [34] Liu YP, Sellke M (2018) Personal communication via email.Google Scholar
  • [35] Luo H, Schapire RE (2015) Achieving all with no parameters: Adanormalhedge. Proc. 28th Conf. Learn. Theory, 1286–1304.Google Scholar
  • [36] Lykouris T, Syrgkanis V, Tardos E (2016) Learning and efficiency in games with dynamic population. Proc. 27th Annual ACM-SIAM Sympos. Discrete Algorithms, 120–129.Google Scholar
  • [37] Mannor S, Shamir O (2011) From bandits to experts: On the value of side-observations. Proc. 24th Internat. Conf. Neural Inform. Processing Systems, 684–692.Google Scholar
  • [38] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Annual Conf. Neural Inform. Processing Systems, 3168–3176.Google Scholar
  • [39] Neu G (2015) First-order regret bounds for combinatorial semi-bandits. Proc. 28th Conf. Learn. Theory, 1360–1375.Google Scholar
  • [40] Neu G, Bartók G (2016) Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. Machine Learn. Res. 17(1):5355–5375.Google Scholar
  • [41] Rakhlin A, Sridharan K (2013) Online learning with predictable sequences. Proc. 26th Annual Conf. Learn. Theory, 993–1019.Google Scholar
  • [42] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. 27th Conf. Learn. Theory, 1232–1264.Google Scholar
  • [43] Rakhlin A, Sridharan K (2016) Bistro: An efficient relaxation-based method for contextual bandits. Proc. 33rd Internat. Conf. Machine Learn., vol. 48, 1977–1985.Google Scholar
  • [44] Rakhlin A, Sridharan K (2017) On equivalence of martingale tail bounds and deterministic regret inequalities. Proc. 30th Conf. Learn. Theory, 1704–1722.Google Scholar
  • [45] Rakhlin A, Sridharan K, Tewari A (2010) Online learning: Random averages, combinatorial parameters, and learnability. Preprint, submitted June 6, https://arxiv.org/abs/1006.1138.Google Scholar
  • [46] Roughgarden T (2015) Intrinsic robustness of the price of anarchy. J. ACM 62(5)1–42.CrossrefGoogle Scholar
  • [47] Roughgarden T, Wang JR (2016) Minimizing regret with multiple reserves. Proc. 2016 ACM Conf. Econom. Comput., 601–616.Google Scholar
  • [48] Syrgkanis V, Krishnamurthy A, Schapire RE (2016a) Efficient algorithms for adversarial contextual learning. Proc. 33rd Internat. Conf. Machine Learn., 2159–2168.Google Scholar
  • [49] Syrgkanis V, Luo H, Krishnamurthy A, Schapire RE (2016b) Improved regret bounds for oracle-based adversarial contextual bandits. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3143–3151.Google Scholar
  • [50] Tossou A, Dimitrakakis C, Dubhashi D (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. Conf. AAAI Artificial Intelligence, 31(1).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.