Open Access

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Omar Besbes
Omar Besbes
Graduate School of Business, Columbia University, New York, New York 10027;
Search for more papers by this author
,
Yonatan Gur
Corresponding Author
Yonatan Gur
http://orcid.org/0000-0003-0764-3570
Graduate School of Business, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Assaf Zeevi
Assaf Zeevi
Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author

Omar Besbes

Graduate School of Business, Columbia University, New York, New York 10027;

Search for more papers by this author

Yonatan Gur

Corresponding Author

Yonatan Gur

http://orcid.org/0000-0003-0764-3570

Graduate School of Business, Stanford University, Stanford, California 94305;

Search for more papers by this author

Assaf Zeevi

Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:31 Oct 2019https://doi.org/10.1287/stsy.2019.0033

References

Audibert J, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory (COLT), Montreal.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–246.Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The non-stochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Google Scholar
Azar MG, Lazaric A, Brunskill E (2014). Online stochastic optimization under correlated bandit feedback. Proc. 31st Internat. Conf. Machine Learn., Beijing.Google Scholar
Bergemann D, Valimaki J (1996) Learning and strategic pricing. Econometrica: J. Econometric Soc. 64(5):1125–1149.Google Scholar
Bergemann D, Hege U (2005) The financing of innovation: Learning and stopping. RAND J. Econom. 36(4):719–752.Google Scholar
Berry DA, Fristedt B (1985) Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, London).Google Scholar
Bertsimas D, Nino-Mora J (2000) Restless bandits, linear programming relaxations, and primal dual index heuristic. Oper. Res. 48(1):80–90.Link, Google Scholar
Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (MIT Press, Cambridge, MA), 199–207.Google Scholar
Besbes O, Gur Y, Zeevi A (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.Link, Google Scholar
Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1):1–8.Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
Cao Y, Zheng W, Kveton B, Xie Y (2019) Nearly optimal adaptive procedure for piecewise-stationary bandit: A change-point detection approach. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
Caro F, Gallien J (2007) Dynamic assortment with demand learning for seasonal consumer goods. Management Sci. 53(2):276–292.Google Scholar
Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019). Learning to optimize under non-stationarity. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
Foster DP, Vohra RV (1999) Regret in the on-line decision problem. Games Econom. Behav. 29(1–2):7–35.Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Google Scholar
Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Kivinen J, Szepesvari C, Ukkonen E, Zeugmann T, eds. Algorithmic learning theory, Lecture Notes in Computer Science, vol. 6925 (Springer, Berlin, Heidelberg), 174–188.Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. B. 41(2):148–177.Google Scholar
Gittins JC (1989) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, New York).Google Scholar
Gittins JC, Jones DM (1974) A dynamic allocation index for the sequential design of experiments. Gani J, Sarkadi K, Vincze I, eds. Progress in Statistics (North-Holland, Amsterdam), 241–266. Google Scholar
Guha S, Munagala K (2007) Approximation algorithms for partial-information based stochastic control with markovian rewards. 48th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 483–493.Google Scholar
Hannan J (1957) Approximation to Bayes Risk in Repeated Play, Contributions to the Theory of Games, vol. 3. (Princeton University Press, Princeton, NJ).Google Scholar
Hazan E, Kale S (2011) Better algorithms for benign bandits. J. Machine Learn. Res. 12:1287–1311.Google Scholar
Jadbabaie A, Rakhlin A, Shahrampour S, Sridharan K (2015) Online optimization: Competing with dynamic comparators. Lebanon G, Vishwanathan SVN, eds. 18th Internat. Conf. Artificial Intelligence Statist., San Diego.Google Scholar
Karnin Z, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Proc. Adv. Neural Inform. Processing Systems, 199–207.Google Scholar
Kleinberg R, Leighton T (2003) The value of knowing a demand curve: Bounds on regret for online posted-price auctions. Proc. 44th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 594–605.Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Google Scholar
Levine N, Crammer K, Mannor S (2017) Rotting bandits. Proc. Adv. Neural Inform. Processing Systems, 3074–3083.Google Scholar
Luo H, Wei C, Agarwal A, Langford J (2018). Efficient contextual bandits in nonstationary worlds. 31st Conf Learn Theory.Google Scholar
Ortner R, Ryabko D, Auer P, Munos R (2014) Regret bounds for restless markov bandits. Theoretical Comp. Sci. 558(November):62–76.Google Scholar
Pandey S, Agarwal D, Chakrabarti D, Josifovski V (2007) Bandits for taxonomies: A model-based approach. Proc. 2007 SIAM Internat. Conf. Data Mining, Minneapolis.Google Scholar
Papadimitriou CH, Tsitsiklis JN (1994). The complexity of optimal queueing network control. Structure Complexity Theory Conf., 318–322.Google Scholar
Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.). 58(5):527–535.Google Scholar
Slivkins A (2014) Contextual bandits with similarity information. J. Machine Learn. Res. 15(1):2533–2568.Google Scholar
Slivkins A, Upfal E (2008). Adapting to a changing environment: The Brownian restless bandits. Proc. 21st Annual Conf. Learn. Theory, 343–354.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 25(3):285–294.Google Scholar
Wei C, Hong Y, Lu C (2016) Tracking the best expert in non-stationary stochastic environments. Proc. Adv. Neural Inform. Processing Systems, 3972–3980.Google Scholar
Whittle P (1981) Arm acquiring bandits. Ann. Probab. 9(2):284–292.Google Scholar
Whittle P (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25A:287–298.Google Scholar
Zelen M (1969) Play the winner rule and the controlled clinical trials. J. Amer. Statist. Assoc. 64(325):131–146.Google Scholar
Zhang L, Yang T, Zhou Z (2018). Dynamic regret of strongly adaptive methods. Proc. 35th Internat. Conf. Machine Learn., 5877–5886.Google Scholar

Volume 9, Issue 4

December 2019

Pages 319-416

Article Information

Metrics

Information

Received:April 26, 2018
Accepted:March 11, 2019
Published Online:October 31, 2019

Cite as

Omar Besbes, Yonatan Gur, Assaf Zeevi (2019) Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards. Stochastic Systems 9(4):319-337.

https://doi.org/10.1287/stsy.2019.0033

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

References

Volume 9, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News