Offline Planning and Online Learning Under Recovering Rewards

David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;
Search for more papers by this author
,
Zeyu Zheng
Zeyu Zheng
[email protected]
https://orcid.org/0000-0001-5653-152X
Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94709
Search for more papers by this author
,
Feng Zhu
Corresponding Author
Feng Zhu
[email protected]
https://orcid.org/0000-0003-4979-4879
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;
Search for more papers by this author

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Search for more papers by this author

Zeyu Zheng

[email protected]

https://orcid.org/0000-0001-5653-152X

Department of Industrial Engineering and Operations Research, University of California, Berkeley, Berkeley, California 94709

Search for more papers by this author

Feng Zhu

Corresponding Author

Feng Zhu

[email protected]

https://orcid.org/0000-0003-4979-4879

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Search for more papers by this author

Published Online:3 Apr 2024https://doi.org/10.1287/mnsc.2021.04202

References

Atsidakou A, Papadigenopoulos O, Basu S, Caramanis C, Shakkottai S (2021) Combinatorial blocking bandits with stochastic delays. Internat. Conf. Machine Learn. (PMLR), 404–413.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
Auer P, Gajane P, Ortner R (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. Conf. Learn. Theory (PMLR), 138–158.Google Scholar
Basu S, Papadigenopoulos O, Caramanis C, Shakkottai S (2021) Contextual blocking bandits. Internat. Conf. Artificial Intelligence Statist. (PMLR), 271–279.Google Scholar
Basu S, Sen R, Sanghavi S, Shakkottai S (2019) Blocking bandits. Adv. Neural Inform. Processing Systems 32:4784–4793.Google Scholar
Besbes O, Gur Y, Zeevi A (2014) Stochastic multi-armed bandit problem with non-stationary rewards. Adv. Neural Inform. Processing Systems 27:199–207.Google Scholar
Besson L, Kaufmann E (2018) What doubling tricks can and can’t do for multi-armed bandits. Preprint, submitted March 19, https://arxiv.org/abs/1803.06971.Google Scholar
Cella L, Cesa-Bianchi N (2020) Stochastic bandits with delay-dependent payoffs. Internat. Conf. Artificial Intelligence Statist. (PMLR), 1168–1177.Google Scholar
Chen W, Wang Y, Yuan Y (2013) Combinatorial multi-armed bandit: General framework and applications. Internat. Conf. Machine Learn. (PMLR), 151–159.Google Scholar
Dickerson J, Sankararaman K, Srinivasan A, Xu P (2021) Allocation problems in ride-sharing platforms: Online matching with offline reusable resources. ACM Trans. Economics Computation 9(3):1–17.Google Scholar
Enberg J (2021) How important will livestreaming be for social commerce in 2021? Marketer (July 01), https://www.emarketer.com/content/how-important-will-livestreaming-social-commerce-2021.Google Scholar
Feng Y, Niazadeh R, Saberi A (2019) Linear programming based online policies for real-time assortment of reusable resources. Chicago Booth Research Paper No. 20-25, University of Chicago Booth School of Business, Chicago.Google Scholar
Feng Y, Niazadeh R, Saberi A (2022) Near-optimal Bayesian online assortment of reusable resources. Proc. 23rd ACM Conf. Econom. Comput. (Association for Computing Machinery, New York), 964–965.Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. B 41(2):148–164.Crossref, Google Scholar
Gong XY, Goyal V, Iyengar GN, Simchi-Levi D, Udwani R, Wang S (2022) Online assortment optimization with reusable resources. Management Sci. 68(7):4772–4785.Link, Google Scholar
Goyal V, Iyengar G, Udwani R (2020a) Online allocation of reusable resources: Achieving optimal competitive ratio. Preprint, submitted February 6, https://arxiv.org/abs/2002.02430.Google Scholar
Goyal V, Iyengar G, Udwani R (2020b) Online allocation of reusable resources via algorithms guided by fluid approximations. Preprint, submitted October 8, https://arxiv.org/abs/2010.03983.Google Scholar
Greenwald M (2020) Live streaming e-commerce is the rage in China. Is the U.S. next? Forbes (December 10), https://www.forbes.com/sites/michellegreenwald/2020/12/10/live-streaming-e-commerce-is-the-rage-in-china-is-the-us-next/.Google Scholar
Holte R, Mok A, Rosier L, Tulchinsky I, Varvel D (1989) The pinwheel: A real-time scheduling problem. Proc. 22nd Hawaii Internat. Conf. System Sci., 693–702.Google Scholar
Keskin NB, Zeevi A (2017) Chasing demand: Learning and earning in a changing environment. Math. Oper. Res. 42(2):277–307.Link, Google Scholar
Kharif O, Townsend M (2020) Livestreams Are the Future of Shopping in America (Bloomberg, New York).Google Scholar
Kleinberg R, Immorlica N (2018) Recharging bandits. 59th Annual Sympos. Foundations Comput. Sci. (IEEE, Piscataway, NJ), 309–319.Google Scholar
Kveton B, Wen Z, Ashkan A, Szepesvari C (2015) Tight regret bounds for stochastic combinatorial semi-bandits. Artificial Intelligence Statist. (PMLR), 535–543.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Levi R, Radovanović A (2010) Provably near-optimal LP-based policies for revenue management in systems with reusable resources. Oper. Res. 58(2):503–507.Link, Google Scholar
Levine N, Crammer K, Mannor S (2017) Rotting bandits. Adv. Neural Inform. Processing Systems 30:3074–3083.Google Scholar
Mintz Y, Aswani A, Kaminsky P, Flowers E, Fukuoka Y (2020) Nonstationary bandits with habituation and recovery dynamics. Oper. Res. 68(5):1493–1516.Link, Google Scholar
Owen Z, Simchi-Levi D (2018) Price and assortment optimization for reusable resources. Preprint, submitted November 16, 2017, https://dx.doi.org/10.2139/ssrn.3070625.Google Scholar
Papadigenopoulos O, Caramanis C (2021) Recurrent submodular welfare and matroid blocking bandits. Adv. Neural Inform. Processing Systems 34:23334–23346.Google Scholar
Pike-Burke C, Grunewalder S (2019) Recovering bandits. Adv. Neural Inform. Processing Systems 32:14122–14131.Google Scholar
Rusmevichientong P, Sumida M, Topaloglu H (2020) Dynamic assortment optimization for reusable products with random usage durations. Management Sci. 66(7):2820–2844.Link, Google Scholar
Sgall J, Shachnai H, Tamir T (2009) Periodic scheduling with obligatory vacations. Theoret. Comput. Sci. 410(47–49):5112–5121.Crossref, Google Scholar
Shen Z, Tang C, Wu D, Yuan R, Zhou W (2024) JD.com: Transaction level data for the 2020 MSOM data driven research challenge. Manufacturing Service Oper. Management 26(1):2–10.Link, Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, vol. 12 (1–2) (Now Publishers Inc., Hanover, MA), 1–286.Google Scholar
Talluri K, Tsoukalas A (2023) Revenue management of a professional services firm with quality revelation. Oper. Res. 71(4):1260–1276.Link, Google Scholar
Udwani R (2022) Periodic reranking for online matching of reusable resources. Proc. 23rd ACM Conf. Economics Comput. (Association for Computing Machinery, New York), 966.Google Scholar
Whittle P (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25:287–298.Crossref, Google Scholar
Yancey KP, Settles B (2020) A sleeping, recovering bandit algorithm for optimizing recurring notifications. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 3008–3016.Google Scholar
Zhu F, Zheng Z (2020) When demands evolve larger and noisier: Learning and earning in a growing environment. Internat. Conf. Machine Learn. (PMLR), 11629–11638.Google Scholar

Volume 71, Issue 1

January 2025

Pages iv-vi, 1-953, vii-vii

Article Information

Supplemental Material

Metrics

Information

Received:December 21, 2021
Accepted:August 26, 2023
Published Online:April 03, 2024

Cite as

David Simchi-Levi, Zeyu Zheng, Feng Zhu (2024) Offline Planning and Online Learning Under Recovering Rewards. Management Science 71(1):298-317.

https://doi.org/10.1287/mnsc.2021.04202

Keywords

Acknowledgments

The authors gratefully acknowledge the department editor, the associate editor, and the reviewers for their time and comments that greatly helped improve the manuscript. The work of David Simchi-Levi and Feng Zhu is partially supported by the MIT Data Science Laboratory. The authors are listed in alphabetical order.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Offline Planning and Online Learning Under Recovering Rewards

References

Volume 71, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News