Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback

Yining Wang
Yining Wang
[email protected]
https://orcid.org/0000-0001-9410-0392
Naveen Jindal School of Management, University of Texas at Dallas, Richardson, Texas 75080
Search for more papers by this author

Naveen Jindal School of Management, University of Texas at Dallas, Richardson, Texas 75080

Search for more papers by this author

Published Online:31 Jul 2023https://doi.org/10.1287/opre.2022.0576

References

Agarwal A, Dekel O, Xiao L (2010) Optimal algorithms for online convex optimization with multi-point bandit feedback. Tauman Kalai A, Mohri M, eds. Proc. 23rd Annual Conf. Learning Theory (COLT 2010) (OmniPress, Haifa, Israel), 28–40.Google Scholar
Agarwal A, Foster DP, Hsu D, Kakade SM, Rakhlin A (2013) Stochastic convex optimization with bandit feedback. SIAM J. Optim. 23(1):213–240.Crossref, Google Scholar
Agrawal S, Jia R (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
Azuma K (1967) Weighted sums of certain dependent random variables. Tohoku Math. J. (2) 19(3):357–367.Crossref, Google Scholar
Baby D, Wang Y-X (2022) Optimal dynamic regret in proper online learning with strongly convex losses and beyond. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2022) (Proceedings of Machine Learning Research, New York), 1805–1845.Google Scholar
Benveniste A, Métivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Berlin).Google Scholar
Besbes O, Gur Y, Zeevi A (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.Link, Google Scholar
Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Gordon G, Dunson D, Dudik M, eds. Proc. 14th Internat. Conf. Artificial Intelligence and Statist. (AISTATS 2011) (Proceedings of Machine Learning Research, New York), 19–26.Google Scholar
Bubeck S, Cesa-Bianchi N, Kakade SM (2012) Toward minimax policies for online linear optimization with bandit feedback. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learning Theory (COLT 2012) (Proceedings of Machine Learning Research, New York), 41.1–41.14.Google Scholar
Bubeck S, Eldan R, Lee YT (2021) Kernel-based methods for bandit convex optimization. J. ACM 68(4):1–35.Crossref, Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2019) Non-stationary reinforcement learning: The blessing of (more) optimism. Preprint, submitted June 7, https://arxiv.org/abs/1906.02922.Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2022) Hedging the drift: Learning to optimize under nonstationarity. Management Sci. 68(3):1696–1713.Link, Google Scholar
Combes R, Talebi MS, Proutiere A, Lelarge M (2015) Combinatorial bandits revisited. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Proc. 29th Conf. Neural Inform. Processing Systems (Advances in Neural Information Processing Systems, San Diego, CA), 2116–2124.Google Scholar
Flaxman AD, Kalai AT, McMahan HB (2005) Online convex optimization in the bandit setting: Gradient descent without a gradient. Buchsbaum A, ed. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms (SODA 2005) (Association for Computing Machinery, Inc., New York), 385–394.Google Scholar
Hazan E, Seshadhri C (2009) Efficient learning algorithms for changing environments. Bottou L, Littman M, eds. Proc. 26th Internat. Conf. Machine Learn. (ICML 2009) (ICML, Madison, WI), 393–400.Google Scholar
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13–30.Google Scholar
Immorlica N, Sankararaman K, Schapire R, Slivkins A (2022) Adversarial bandits with knapsacks. J. ACM 69(6):1–47.Crossref, Google Scholar
Lai TL (2003) Stochastic approximation. Ann. Statist. 31(2):391–406.Crossref, Google Scholar
Lattimore T (2020) Improved regret for zeroth-order adversarial bandit convex optimisation. Math. Statist. Learn. 2(3):311–334.Crossref, Google Scholar
Lattimore T, Gyorgy A (2021) Improved regret for zeroth-order stochastic convex bandits. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learning Theory (COLT 2021) (Proceedings of Machine Learning Research, New York), 2938–2964.Google Scholar
Lu S, Zhou Y-H, Shi J-C, Zhu W, Yu Q, Chen Q-G, Da Q, Zhang L (2022) Non-stationary continuum-armed bandits for online hyperparameter optimization. Akoglu L, Dong XL, Tang J, eds. Proc. 15th ACM Internat. WSDM Conf. (WSDM 2022) (Association for Computing Machinery, New York), 618–627.Google Scholar
Mao W, Zhang K, Zhu R, Simchi-Levi D, Başar T (2020) Model-free non-stationary RL: Near-optimal regret and applications in multi-agent RL and inventory control. Preprint, submitted October 7, https://arxiv.org/abs/2010.03161.Google Scholar
Nemirovskij AS, Yudin DB (1983) Problem complexity and method efficiency in optimization.Google Scholar
Saha A, Koren T, Mansour Y (2021) Adversarial dueling bandits. Florina Balcan M, Zhu J, Arora R, eds. Proc. 38th Internat. Conf. Machine Learn. (ICML 2021) (Proceedings of Machine Learning Research, New York), 9235–9244.Google Scholar
Schmidt M, Roux N, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Proc. 24th Conf. Neural Inform. Processing Systems (Advances in Neural Information Processing Systems, San Diego), 1458–1466.Google Scholar
Wang Y, Du S, Balakrishnan S, Singh A (2018) Stochastic zeroth-order optimization in high dimensions. Storkey A, Perez-Cruz F, eds. Proc. 21st Internat. Conf. Artificial Intelligence Statist. (AISTATS 2018) (Proceedings of Machine Learning Research, New York), 1356–1365.Google Scholar
Wei C-Y, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learn. Theory (COLT 2021) (Proceedings of Machine Learning Research, New York), 4300–4354.Google Scholar
Zhang L, Lu S, Yang T (2020) Minimizing dynamic regret and adaptive regret simultaneously. Chiappa S, Calandra R, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. (AISTATS 2020) (Proceedings of Machine Learning Research, New York), 309–319.Google Scholar
Zhang L, Yang T, Zhou Z-H (2018) Dynamic regret of strongly adaptive methods. Bach F, Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (ICML 2018) (Proceedings of Machine Learning Research, New York), 5882–5891.Google Scholar
Zhao P, Wang Y-X, Zhou Z-H (2022) Non-stationary online learning with memory and non-stochastic control. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2022) (Proceedings of Machine Learning Research, New York), 1805–1845.Google Scholar
Zhao P, Zhang Y-J, Zhang L, Zhou Z-H (2020) Dynamic regret of convex and smooth functions. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Proc. 34th Conf. Neural Inform. Processing Systems, vol. 33 (Neural Information Processing Systems, San Diego), 12510–12520.Google Scholar

Volume 73, Issue 2

March-April 2025

Pages iii-viii, 583-1150, C2-C3

Article Information

Metrics

Information

Received:October 28, 2022
Accepted:June 12, 2023
Published Online:July 31, 2023

Cite as

Yining Wang (2023) Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback. Operations Research 73(2):819-828.

https://doi.org/10.1287/opre.2022.0576

Keywords

Acknowledgments

The author thanks the department editor, the associate editor, and two anonymous referees for offering valuable comments and suggestions that greatly improved the paper. The author also thanks Haipeng Luo and Chen-Yu Wei for providing helpful comments and suggestions that improved the positioning and exposition of this paper.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Technical Note—On Adaptivity in Nonstationary Stochastic Optimization with Bandit Feedback

References

Volume 73, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News