Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards
Published Online:31 Oct 2019https://doi.org/10.1287/stsy.2019.0033
References
- (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory (COLT), Montreal.Google Scholar
- (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–246.Google Scholar
- (2002b) The non-stochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Google Scholar
- (2014). Online stochastic optimization under correlated bandit feedback. Proc. 31st Internat. Conf. Machine Learn., Beijing.Google Scholar
- (1996) Learning and strategic pricing. Econometrica: J. Econometric Soc. 64(5):1125–1149.Google Scholar
- (2005) The financing of innovation: Learning and stopping. RAND J. Econom. 36(4):719–752.Google Scholar
- (1985) Bandit Problems: Sequential Allocation of Experiments (Chapman and Hall, London).Google Scholar
- (2000) Restless bandits, linear programming relaxations, and primal dual index heuristic. Oper. Res. 48(1):80–90.Link, Google Scholar
- (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (MIT Press, Cambridge, MA), 199–207.Google Scholar
- (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.Link, Google Scholar
- (1956) An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1):1–8.Google Scholar
- (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
- (2019) Nearly optimal adaptive procedure for piecewise-stationary bandit: A change-point detection approach. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
- (2007) Dynamic assortment with demand learning for seasonal consumer goods. Management Sci. 53(2):276–292.Google Scholar
- (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Google Scholar
- (2019). Learning to optimize under non-stationarity. 22nd Internat. Conf. Artificial Intelligence Statist., Okinawa, Japan.Google Scholar
- (1999) Regret in the on-line decision problem. Games Econom. Behav. 29(1–2):7–35.Google Scholar
- (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Google Scholar
- (2011) On upper-confidence bound policies for switching bandit problems. Kivinen J, Szepesvari C, Ukkonen E, Zeugmann T, eds. Algorithmic learning theory, Lecture Notes in Computer Science, vol. 6925 (Springer, Berlin, Heidelberg), 174–188.Google Scholar
- (1979) Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. B. 41(2):148–177.Google Scholar
- (1989) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, New York).Google Scholar
- (1974) A dynamic allocation index for the sequential design of experiments. Gani J, Sarkadi K, Vincze I, eds. Progress in Statistics (North-Holland, Amsterdam), 241–266. Google Scholar
- (2007) Approximation algorithms for partial-information based stochastic control with markovian rewards. 48th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 483–493.Google Scholar
- (1957) Approximation to Bayes Risk in Repeated Play, Contributions to the Theory of Games, vol. 3. (Princeton University Press, Princeton, NJ).Google Scholar
- (2011) Better algorithms for benign bandits. J. Machine Learn. Res. 12:1287–1311.Google Scholar
- (2015) Online optimization: Competing with dynamic comparators. Lebanon G, Vishwanathan SVN, eds. 18th Internat. Conf. Artificial Intelligence Statist., San Diego.Google Scholar
- (2016) Multi-armed bandits: Competing with optimal sequences. Proc. Adv. Neural Inform. Processing Systems, 199–207.Google Scholar
- (2003) The value of knowing a demand curve: Bounds on regret for online posted-price auctions. Proc. 44th Annual IEEE Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 594–605.Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Google Scholar
- (2017) Rotting bandits. Proc. Adv. Neural Inform. Processing Systems, 3074–3083.Google Scholar
- (2018). Efficient contextual bandits in nonstationary worlds. 31st Conf Learn Theory.Google Scholar
- (2014) Regret bounds for restless markov bandits. Theoretical Comp. Sci. 558(November):62–76.Google Scholar
- (2007) Bandits for taxonomies: A model-based approach. Proc. 2007 SIAM Internat. Conf. Data Mining, Minneapolis.Google Scholar
- (1994). The complexity of optimal queueing network control. Structure Complexity Theory Conf., 318–322.Google Scholar
- (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.). 58(5):527–535.Google Scholar
- (2014) Contextual bandits with similarity information. J. Machine Learn. Res. 15(1):2533–2568.Google Scholar
- (2008). Adapting to a changing environment: The Brownian restless bandits. Proc. 21st Annual Conf. Learn. Theory, 343–354.Google Scholar
- (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 25(3):285–294.Google Scholar
- (2016) Tracking the best expert in non-stationary stochastic environments. Proc. Adv. Neural Inform. Processing Systems, 3972–3980.Google Scholar
- (1981) Arm acquiring bandits. Ann. Probab. 9(2):284–292.Google Scholar
- (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25A:287–298.Google Scholar
- (1969) Play the winner rule and the controlled clinical trials. J. Amer. Statist. Assoc. 64(325):131–146.Google Scholar
- (2018). Dynamic regret of strongly adaptive methods. Proc. 35th Internat. Conf. Machine Learn., 5877–5886.Google Scholar

