Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Published Online:https://doi.org/10.1287/stsy.2019.0033

References

  • Audibert J, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory (COLT), Montreal.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–246.Google Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The non-stochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Google Scholar
  • Azar MG, Lazaric A, Brunskill E (2014). Online stochastic optimization under correlated bandit feedback. Proc. 31st Internat. Conf. Machine Learn., Beijing.Google Scholar
  • Bergemann D, Valimaki J (1996) Learning and strategic pricing. Econometrica: J. Econometric Soc. 64(5):1125–1149.Google Scholar
  • Bergemann D, Hege U (2005) The financing of innovation: Learning and stopping. RAND J. Econom. 36(4):719–752.Google Scholar
  • Berry DA, Fristedt B (1985) Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, London).Google Scholar
  • Bertsimas D, Nino-Mora J (2000) Restless bandits, linear programming relaxations, and primal dual index heuristic. Oper. Res. 48(1):80–90.LinkGoogle Scholar
  • Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (MIT Press, Cambridge, MA), 199–207.Google Scholar
  • Besbes O, Gur Y, Zeevi A (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.LinkGoogle Scholar
  • Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1):1–8.Google Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
  • Cao Y, Zheng W, Kveton B, Xie Y (2019) Nearly optimal adaptive procedure for piecewise-stationary bandit: A change-point detection approach. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
  • Caro F, Gallien J (2007) Dynamic assortment with demand learning for seasonal consumer goods. Management Sci. 53(2):276–292.Google Scholar
  • Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019). Learning to optimize under non-stationarity. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
  • Foster DP, Vohra RV (1999) Regret in the on-line decision problem. Games Econom. Behav. 29(1–2):7–35.Google Scholar
  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Google Scholar
  • Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. Kivinen J, Szepesvari C, Ukkonen E, Zeugmann T, eds. Algorithmic learning theory, Lecture Notes in Computer Science, vol. 6925 (Springer, Berlin, Heidelberg), 174–188.Google Scholar
  • Gittins JC (1979) Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. B. 41(2):148–177.Google Scholar
  • Gittins JC (1989) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, New York).Google Scholar
  • Gittins JC, Jones DM (1974) A dynamic allocation index for the sequential design of experiments. Gani J, Sarkadi K, Vincze I, eds. Progress in Statistics (North-Holland, Amsterdam), 241–266. Google Scholar
  • Guha S, Munagala K (2007) Approximation algorithms for partial-information based stochastic control with markovian rewards. 48th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 483–493.Google Scholar
  • Hannan J (1957) Approximation to Bayes Risk in Repeated Play, Contributions to the Theory of Games, vol. 3. (Princeton University Press, Princeton, NJ).Google Scholar
  • Hazan E, Kale S (2011) Better algorithms for benign bandits. J. Machine Learn. Res. 12:1287–1311.Google Scholar
  • Jadbabaie A, Rakhlin A, Shahrampour S, Sridharan K (2015) Online optimization: Competing with dynamic comparators. Lebanon G, Vishwanathan SVN, eds. 18th Internat. Conf. Artificial Intelligence Statist., San Diego.Google Scholar
  • Karnin Z, Anava O (2016) Multi-armed bandits: Competing with optimal sequences. Proc. Adv. Neural Inform. Processing Systems, 199–207.Google Scholar
  • Kleinberg R, Leighton T (2003) The value of knowing a demand curve: Bounds on regret for online posted-price auctions. Proc. 44th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 594–605.Google Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Google Scholar
  • Levine N, Crammer K, Mannor S (2017) Rotting bandits. Proc. Adv. Neural Inform. Processing Systems, 3074–3083.Google Scholar
  • Luo H, Wei C, Agarwal A, Langford J (2018). Efficient contextual bandits in nonstationary worlds. 31st Conf Learn Theory.Google Scholar
  • Ortner R, Ryabko D, Auer P, Munos R (2014) Regret bounds for restless markov bandits. Theoretical Comp. Sci. 558(November):62–76.Google Scholar
  • Pandey S, Agarwal D, Chakrabarti D, Josifovski V (2007) Bandits for taxonomies: A model-based approach. Proc. 2007 SIAM Internat. Conf. Data Mining, Minneapolis.Google Scholar
  • Papadimitriou CH, Tsitsiklis JN (1994). The complexity of optimal queueing network control. Structure Complexity Theory Conf., 318–322.Google Scholar
  • Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.). 58(5):527–535.Google Scholar
  • Slivkins A (2014) Contextual bandits with similarity information. J. Machine Learn. Res. 15(1):2533–2568.Google Scholar
  • Slivkins A, Upfal E (2008). Adapting to a changing environment: The Brownian restless bandits. Proc. 21st Annual Conf. Learn. Theory, 343–354.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 25(3):285–294.Google Scholar
  • Wei C, Hong Y, Lu C (2016) Tracking the best expert in non-stationary stochastic environments. Proc. Adv. Neural Inform. Processing Systems, 3972–3980.Google Scholar
  • Whittle P (1981) Arm acquiring bandits. Ann. Probab. 9(2):284–292.Google Scholar
  • Whittle P (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25A:287–298.Google Scholar
  • Zelen M (1969) Play the winner rule and the controlled clinical trials. J. Amer. Statist. Assoc. 64(325):131–146.Google Scholar
  • Zhang L, Yang T, Zhou Z (2018). Dynamic regret of strongly adaptive methods. Proc. 35th Internat. Conf. Machine Learn., 5877–5886.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.