Satisficing in Time-Sensitive Bandit Learning

Daniel Russo
Daniel Russo
[email protected]
https://orcid.org/0000-0001-5926-8624
Columbia Business School, Columbia University, New York, New York 10027;
Search for more papers by this author
,
Benjamin Van Roy
Benjamin Van Roy
[email protected]
Department of Electrical Engineering and Department of Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author

Columbia Business School, Columbia University, New York, New York 10027;

Department of Electrical Engineering and Department of Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Published Online:14 Mar 2022https://doi.org/10.1287/moor.2021.1229

References

[1] Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. Proc. Sixteenth Internat. Conf. Artificial Intelligence Statist., 99–107.Google Scholar
[2] Berry DA, Chen RW, Zame A, Heath DC, Shepp LA (1997) Bandit problems with infinitely many arms. Ann. Statist. 25(5):2103–2116.Crossref, Google Scholar
[3] Bonald T, Proutiere A (2013) Two-target algorithms for infinite-armed bandits with Bernoulli rewards. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY), 2184–2192.Google Scholar
[4] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
[5] Bubeck S, Eldan R (2016) Multi-scale exploration of convex functions and bandit convex optimization. Conf. Learn. Theory (PMLR), 583–589.Google Scholar
[6] Bubeck S, Munos R, Stoltz G, Szepesvári C (2011) X-armed bandits. J. Machine Learn. Res. 12:1655–1695.Google Scholar
[7] Cappé O, Garivier A, Maillard OA, Munos R, Stoltz G (2013) Kullback-Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.Crossref, Google Scholar
[8] Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Shawe-Taylor J, Zemel RS, Bartlett PL, eds. Proc. 24th Internat. Conf. Neural Inform. Processing System (Curran Associates Inc., Red Hook, NY), 2249–2257.Google Scholar
[9] Cover T, Thomas J (2012) Elements of Information Theory (John Wiley & Sons, New York).Google Scholar
[10] Deshpande Y, Montanari A (2012) Linear bandits in high dimension and recommendation systems. 50th Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1750–1754.Google Scholar
[11] Francetich A, Kreps D (2020) Choosing a good toolkit, i: Prior-free heuristics. J. Econom. Dynam. Control 111:103813.Crossref, Google Scholar
[12] Francetich A, Kreps D (2020) Choosing a good toolkit, ii: Bayes-rule based heuristics. J. Econom. Dynam. Control 111:103814.Crossref, Google Scholar
[13] Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
[14] Kleinberg R, Slivkins A, Upfal E (2008) Multi-armed bandits in metric spaces. Proc. 40th ACM Sympos. Theory Comput. (ACM, New York).Crossref, Google Scholar
[15] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances Appl. Math. 6(1):4–22.Crossref, Google Scholar
[16] Lattimore T, Szepesvári C (2019) An information-theoretic approach to minimax regret in partial monitoring. Beygelzimer A, Hsu D, eds. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR), 2111–2139.Google Scholar
[17] Liu F, Buccapatnam S, Shroff N (2018) Information directed sampling for stochastic bandits with graph feedback. Proc. AAAI Conf. Artificial Intelligence, vol. 32 (AAAI, Palo Alto CA).Google Scholar
[18] Lu X, Van Roy B (2019) Information-theoretic confidence bounds for reinforcement learning. Advances in Neural Information Processing Systems (Neurips) 32 (Curran Associates, Inc., Red Hook, NY), 2458–2466.Google Scholar
[19] Rusmevichientong P, Tsitsiklis J (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
[20] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
[21] Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(68):1–30.Google Scholar
[22] Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on Thompson sampling. Foundations Trends Machine Learn. 11(1):1–96.Crossref, Google Scholar
[23] Ryzhov I, Powell W, Frazier P (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
[24] Scott S (2010) A modern Bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Indust. 26(6):639–658.Crossref, Google Scholar
[25] Simon HA (1979) Rational decision making in business organizations. Amer. Econom. Rev. 69(4):493–513.Google Scholar
[26] Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
[27] Tossou AC, Dimitrakakis C, Dubhashi DP (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. 31st AAAI Conf. Artificial Intelligence (AAAI, Palo Alto, CA), 2660–2666.Crossref, Google Scholar
[28] Tsybakov AB (2008) Introduction to Nonparametric Estimation (Springer Science & Business Media, Berlin).Google Scholar
[29] Wang Y, Audibert JY, Munos R (2009) Algorithms for infinitely many-armed bandits. Advances in Neural Information Processing Systems (NIPS), (Curran Associates, Inc., Red Hook, NY), 1729–1736.Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 4

November 2022

Pages 2547-3399, C2

Article Information

Supplemental Material

Metrics

Information

Received:March 05, 2018
Accepted:December 02, 2020
Published Online:March 14, 2022

Cite as

Daniel Russo, Benjamin Van Roy (2022) Satisficing in Time-Sensitive Bandit Learning. Mathematics of Operations Research 47(4):2815-2839.

https://doi.org/10.1287/moor.2021.1229

Keywords

Acknowledgments

A special thanks is owed to David Tse, who played an important role in the early stages of this work. It was David who first emphasized that bounds based on entropy can be vacuous and pointed us to references on rate-distortion theory. The authors also thank Tor Lattimore for thoughtful comments on an early draft of this work.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Satisficing in Time-Sensitive Bandit Learning

References

Volume 47, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News