Small-Loss Bounds for Online Learning with Partial Information

Thodoris Lykouris
Thodoris Lykouris
[email protected]
https://orcid.org/0000-0002-3375-5579
Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;
Search for more papers by this author
,
Karthik Sridharan
Karthik Sridharan
[email protected]
Cornell University, Ithaca, New York 14850
Search for more papers by this author
,
Éva Tardos
Éva Tardos
[email protected]
Cornell University, Ithaca, New York 14850
Search for more papers by this author

Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Search for more papers by this author

Karthik Sridharan

[email protected]

Cornell University, Ithaca, New York 14850

Search for more papers by this author

Éva Tardos

[email protected]

Cornell University, Ithaca, New York 14850

Search for more papers by this author

Published Online:25 Jan 2022https://doi.org/10.1287/moor.2021.1204

References

[1] Agarwal A, Krishnamurthy A, Langford J, Luo H, Schapire RE (2017) Open problem: First-order regret bounds for contextual bandits. Proc. 2017 Conf. Learn. Theory, vol. 65, 4–7.Google Scholar
[2] Allenberg C, Auer P, Györfi L, Ottucsák G (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. Proc. 17th Internat. Conf. Algorithmic Learn. Theory, 229–243.Google Scholar
[3] Allen-Zhu Z, Bubeck S, Li Y (2018) Make the minority great again: First-order regret bound for contextual bandits. Proc. 35th Internat. Conf. Machine Learn., 186–194.Google Scholar
[4] Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. Proc. 28th Conf. Learn. Theory, 23–35.Google Scholar
[5] Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, 1610–1618.Google Scholar
[6] Alon N, Cesa-Bianchi N, Gentile C, Mannor S, Mansour Y, Shamir O (2017) Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput. 46(6):1785–1826.Crossref, Google Scholar
[7] Audibert J, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11(94):2785–2836.Google Scholar
[8] Audibert JY, Bubeck S, Lugosi G (2014) Regret in online combinatorial optimization. Math. Oper. Res. 39(1):31–45.Link, Google Scholar
[9] Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J. Comput. System Sci. 64(1):48–75.Crossref, Google Scholar
[10] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
[11] Awerbuch B, Kleinberg RD (2004) Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. Proc. 36th Annual ACM Sympos. Theory Comput., 45–53.Google Scholar
[12] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR), 19–26.Google Scholar
[13] Blum A, Hartline JD (2005) Near-optimal online auctions. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms, 1156–1163.Google Scholar
[14] Blum A, Even-Dar E, Ligett K (2010) Routing without regret: On convergence to Nash equilibria of regret-minimizing algorithms in routing games. Theory Comput. 6(1):179–199.Crossref, Google Scholar
[15] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Proc. 40th Annual ACM Sympos. Theory Comput., 373–382.Google Scholar
[16] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learning 5(1):1–122. https://www.nowpublishers.com/article/Details/MAL-024.Google Scholar
[17] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press).Crossref, Google Scholar
[18] Cesa-Bianchi N, Gentile C, Mansour Y (2013) Regret minimization for reserve prices in second-price auctions. Proc. 24th Annual ACM-SIAM Sympos. Discrete Algorithms, 1190–1204.Google Scholar
[19] Cesa-Bianchi N, Lugosi G, Stoltz G (2005) Minimizing regret with label efficient prediction. IEEE Trans. Inform. Theory 51(6):2152–2162.Crossref, Google Scholar
[20] Cohen A, Hazan T, Koren T (2016) Online learning with feedback graphs without the graphs. Proc. 33rd Internat. Conf. Machine Learn., 811–819.Google Scholar
[21] Cover TM (1991) Universal portfolios. Math. Finance 1(1):1–29.Crossref, Google Scholar
[22] Daniely A, Gonen A, Shalev-Shwartz S (2015) Strongly adaptive online learning. Proc. 32nd Internat. Conf. Machine Learn., 1405–1411.Google Scholar
[23] Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos É (2016) Learning in games: Robustness of fast convergence. Annual Conf. Neural Inform. Processing Systems 2016, 4727–4735.Google Scholar
[24] Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Crossref, Google Scholar
[25] Hannan J (1957) Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, vol. 3 (Princeton University Press), 97–139.Google Scholar
[26] Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learn. 69(2–3):169–192.Crossref, Google Scholar
[27] Herbster M, Warmuth MK (1998) Tracking the best expert. Machine Learn. 32(2):151–178.Crossref, Google Scholar
[28] Kalai A, Vempala S (2005) Efficient algorithms for online decision problems. J. Comput. System Sci. 71(3):291–307.Crossref, Google Scholar
[29] Kocák T, Neu G, Valko M (2016) Online learning with noisy side observations. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1186–1194.Google Scholar
[30] Kocák T, Neu G, Valko M, Munos R (2014) Efficient learning by implicit exploration in bandit problems with side observations. Adv. Neural Inform. Processing Systems, 613–621.Google Scholar
[31] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
[32] Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Proc. 20th Internat. Conf. Neural Inform. Processing Systems, 817–824.Google Scholar
[33] Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inform. Comput. 108(2):212–261.Crossref, Google Scholar
[34] Liu YP, Sellke M (2018) Personal communication via email.Google Scholar
[35] Luo H, Schapire RE (2015) Achieving all with no parameters: Adanormalhedge. Proc. 28th Conf. Learn. Theory, 1286–1304.Google Scholar
[36] Lykouris T, Syrgkanis V, Tardos E (2016) Learning and efficiency in games with dynamic population. Proc. 27th Annual ACM-SIAM Sympos. Discrete Algorithms, 120–129.Google Scholar
[37] Mannor S, Shamir O (2011) From bandits to experts: On the value of side-observations. Proc. 24th Internat. Conf. Neural Inform. Processing Systems, 684–692.Google Scholar
[38] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Annual Conf. Neural Inform. Processing Systems, 3168–3176.Google Scholar
[39] Neu G (2015) First-order regret bounds for combinatorial semi-bandits. Proc. 28th Conf. Learn. Theory, 1360–1375.Google Scholar
[40] Neu G, Bartók G (2016) Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. Machine Learn. Res. 17(1):5355–5375.Google Scholar
[41] Rakhlin A, Sridharan K (2013) Online learning with predictable sequences. Proc. 26th Annual Conf. Learn. Theory, 993–1019.Google Scholar
[42] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. 27th Conf. Learn. Theory, 1232–1264.Google Scholar
[43] Rakhlin A, Sridharan K (2016) Bistro: An efficient relaxation-based method for contextual bandits. Proc. 33rd Internat. Conf. Machine Learn., vol. 48, 1977–1985.Google Scholar
[44] Rakhlin A, Sridharan K (2017) On equivalence of martingale tail bounds and deterministic regret inequalities. Proc. 30th Conf. Learn. Theory, 1704–1722.Google Scholar
[45] Rakhlin A, Sridharan K, Tewari A (2010) Online learning: Random averages, combinatorial parameters, and learnability. Preprint, submitted June 6, https://arxiv.org/abs/1006.1138.Google Scholar
[46] Roughgarden T (2015) Intrinsic robustness of the price of anarchy. J. ACM 62(5)1–42.Crossref, Google Scholar
[47] Roughgarden T, Wang JR (2016) Minimizing regret with multiple reserves. Proc. 2016 ACM Conf. Econom. Comput., 601–616.Google Scholar
[48] Syrgkanis V, Krishnamurthy A, Schapire RE (2016a) Efficient algorithms for adversarial contextual learning. Proc. 33rd Internat. Conf. Machine Learn., 2159–2168.Google Scholar
[49] Syrgkanis V, Luo H, Krishnamurthy A, Schapire RE (2016b) Improved regret bounds for oracle-based adversarial contextual bandits. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3143–3151.Google Scholar
[50] Tossou A, Dimitrakakis C, Dubhashi D (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. Conf. AAAI Artificial Intelligence, 31(1).Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 3

August 2022

Pages 1707-2545, C2

Article Information

Metrics

Information

Received:June 13, 2018
Accepted:June 30, 2021
Published Online:January 25, 2022

Cite as

Thodoris Lykouris, Karthik Sridharan, Éva Tardos (2022) Small-Loss Bounds for Online Learning with Partial Information. Mathematics of Operations Research 47(3):2186-2218.

https://doi.org/10.1287/moor.2021.1204

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Small-Loss Bounds for Online Learning with Partial Information

References

Volume 47, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News