Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback

Published Online:https://doi.org/10.1287/opre.2022.0576

References

  • Agarwal A, Dekel O, Xiao L (2010) Optimal algorithms for online convex optimization with multi-point bandit feedback. Tauman Kalai A, Mohri M, eds. Proc. 23rd Annual Conf. Learning Theory (COLT 2010) (OmniPress, Haifa, Israel), 28–40.Google Scholar
  • Agarwal A, Foster DP, Hsu D, Kakade SM, Rakhlin A (2013) Stochastic convex optimization with bandit feedback. SIAM J. Optim. 23(1):213–240.CrossrefGoogle Scholar
  • Agrawal S, Jia R (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Azuma K (1967) Weighted sums of certain dependent random variables. Tohoku Math. J. (2) 19(3):357–367.CrossrefGoogle Scholar
  • Baby D, Wang Y-X (2022) Optimal dynamic regret in proper online learning with strongly convex losses and beyond. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2022) (Proceedings of Machine Learning Research, New York), 1805–1845.Google Scholar
  • Benveniste A, Métivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Berlin).Google Scholar
  • Besbes O, Gur Y, Zeevi A (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.LinkGoogle Scholar
  • Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Gordon G, Dunson D, Dudik M, eds. Proc. 14th Internat. Conf. Artificial Intelligence and Statist. (AISTATS 2011) (Proceedings of Machine Learning Research, New York), 19–26.Google Scholar
  • Bubeck S, Cesa-Bianchi N, Kakade SM (2012) Toward minimax policies for online linear optimization with bandit feedback. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learning Theory (COLT 2012) (Proceedings of Machine Learning Research, New York), 41.1–41.14.Google Scholar
  • Bubeck S, Eldan R, Lee YT (2021) Kernel-based methods for bandit convex optimization. J. ACM 68(4):1–35.CrossrefGoogle Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2019) Non-stationary reinforcement learning: The blessing of (more) optimism. Preprint, submitted June 7, https://arxiv.org/abs/1906.02922.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2022) Hedging the drift: Learning to optimize under nonstationarity. Management Sci. 68(3):1696–1713.LinkGoogle Scholar
  • Combes R, Talebi MS, Proutiere A, Lelarge M (2015) Combinatorial bandits revisited. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Proc. 29th Conf. Neural Inform. Processing Systems (Advances in Neural Information Processing Systems, San Diego, CA), 2116–2124.Google Scholar
  • Flaxman AD, Kalai AT, McMahan HB (2005) Online convex optimization in the bandit setting: Gradient descent without a gradient. Buchsbaum A, ed. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms (SODA 2005) (Association for Computing Machinery, Inc., New York), 385–394.Google Scholar
  • Hazan E, Seshadhri C (2009) Efficient learning algorithms for changing environments. Bottou L, Littman M, eds. Proc. 26th Internat. Conf. Machine Learn. (ICML 2009) (ICML, Madison, WI), 393–400.Google Scholar
  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13–30.Google Scholar
  • Immorlica N, Sankararaman K, Schapire R, Slivkins A (2022) Adversarial bandits with knapsacks. J. ACM 69(6):1–47.CrossrefGoogle Scholar
  • Lai TL (2003) Stochastic approximation. Ann. Statist. 31(2):391–406.CrossrefGoogle Scholar
  • Lattimore T (2020) Improved regret for zeroth-order adversarial bandit convex optimisation. Math. Statist. Learn. 2(3):311–334.CrossrefGoogle Scholar
  • Lattimore T, Gyorgy A (2021) Improved regret for zeroth-order stochastic convex bandits. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learning Theory (COLT 2021) (Proceedings of Machine Learning Research, New York), 2938–2964.Google Scholar
  • Lu S, Zhou Y-H, Shi J-C, Zhu W, Yu Q, Chen Q-G, Da Q, Zhang L (2022) Non-stationary continuum-armed bandits for online hyperparameter optimization. Akoglu L, Dong XL, Tang J, eds. Proc. 15th ACM Internat. WSDM Conf. (WSDM 2022) (Association for Computing Machinery, New York), 618–627.Google Scholar
  • Mao W, Zhang K, Zhu R, Simchi-Levi D, Başar T (2020) Model-free non-stationary RL: Near-optimal regret and applications in multi-agent RL and inventory control. Preprint, submitted October 7, https://arxiv.org/abs/2010.03161.Google Scholar
  • Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization.Google Scholar
  • Saha A, Koren T, Mansour Y (2021) Adversarial dueling bandits. Florina Balcan M, Zhu J, Arora R, eds. Proc. 38th Internat. Conf. Machine Learn. (ICML 2021) (Proceedings of Machine Learning Research, New York), 9235–9244.Google Scholar
  • Schmidt M, Roux N, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Proc. 24th Conf. Neural Inform. Processing Systems (Advances in Neural Information Processing Systems, San Diego), 1458–1466.Google Scholar
  • Wang Y, Du S, Balakrishnan S, Singh A (2018) Stochastic zeroth-order optimization in high dimensions. Storkey A, Perez-Cruz F, eds. Proc. 21st Internat. Conf. Artificial Intelligence Statist. (AISTATS 2018) (Proceedings of Machine Learning Research, New York), 1356–1365.Google Scholar
  • Wei C-Y, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learn. Theory (COLT 2021) (Proceedings of Machine Learning Research, New York), 4300–4354.Google Scholar
  • Zhang L, Lu S, Yang T (2020) Minimizing dynamic regret and adaptive regret simultaneously. Chiappa S, Calandra R, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. (AISTATS 2020) (Proceedings of Machine Learning Research, New York), 309–319.Google Scholar
  • Zhang L, Yang T, Zhou Z-H (2018) Dynamic regret of strongly adaptive methods. Bach F, Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (ICML 2018) (Proceedings of Machine Learning Research, New York), 5882–5891.Google Scholar
  • Zhao P, Wang Y-X, Zhou Z-H (2022) Non-stationary online learning with memory and non-stochastic control. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2022) (Proceedings of Machine Learning Research, New York), 1805–1845.Google Scholar
  • Zhao P, Zhang Y-J, Zhang L, Zhou Z-H (2020) Dynamic regret of convex and smooth functions. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems, San Diego), 12510–12520.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.