Bandits atop Reinforcement Learning: Tackling Online Inventory Models with Cyclic Demands

Published Online:https://doi.org/10.1287/mnsc.2023.4947

References

  • Abbasi-Yadkori Y, Bartlett PL, Kanade V, Seldin Y, Szepesvári C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • Agrawal S, Jia R (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
  • Aviv Y, Federgruen A (1997) Stochastic inventory models with limited production capacity and periodically varying parameters. Probability Engrg. Inform. Sci. 11(1):107–135.CrossrefGoogle Scholar
  • Balseiro SR, Golrezaei N, Mahdian M, Mirrokni VS, Schneider J (2019) Contextual bandits with cross-learning. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • Chatwin RE (1998) Multiperiod airline overbooking with a single fare class. Oper. Res. 46(6):805–819.LinkGoogle Scholar
  • Chen B (2021) Production Oper. Management 30(5):1365–1385.Google Scholar
  • Chen B, Shi C (2019) Tailored base-surge policies in dual-sourcing inventory systems with demand learning. Preprint, submitted September 27, https://dx.doi.org/10.2139/ssrn.3456834.Google Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Daumé III H, Aarti S, eds. Proc. 37th Internat. Conf. Machine Learn. Proceedings of Machine Learning Research Series, vol. 119 (PMLR, New York),1843–1854.Google Scholar
  • Dann C, Mansour Y, Mohri M, Sekhari A, Sridharan K (2020) Reinforcement learning with feedback graphs. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin Hm, eds. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 16868–16878.Google Scholar
  • Davoodi M, Katehakis MN, Yang J (2022) Dynamic inventory control with fixed setup costs and unknown discrete demand distribution. Oper. Res. 70(3):1560–1576.Google Scholar
  • Dong S, Roy BV, Zhou Z (2019) Provably efficient reinforcement learning with aggregated states. Preprint, submitted December 13, https://doi.org/10.48550/arXiv.1912.06366.Google Scholar
  • Ehrenthal J, Honhon D, Woensel TV (2014) Demand seasonality in retail inventory management. Eur. J. Oper. Res. 238(2):527–539.CrossrefGoogle Scholar
  • Huh WT, Rusmevichientong P (2009a) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.LinkGoogle Scholar
  • Huh WT, Rusmevichientong P (2009b) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.LinkGoogle Scholar
  • Huh WT, Rusmevichientong P (2014) Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS J. Comput. 26(1):150–159.LinkGoogle Scholar
  • Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397–416.LinkGoogle Scholar
  • Huh WT, Levi R, Rusmevichientong P, Orlin JB (2011) Adaptive data-driven inventory control with censored demand based on Kaplan-Meier estimator. Oper. Res. 59(4):929–941.LinkGoogle Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY). Google Scholar
  • Kaggle (2015) Rossmann store sales. Accessed August 15, 2020, https://www.kaggle.com/c/rossmann-store-sales/overview.Google Scholar
  • Karlin S (1960) Optimal policy for dynamic inventory process with stochastic demands subject to seasonal variations. J. Soc. Industrial Appl. Math. 8(4):611–629.Google Scholar
  • Lim V (2016) How poor inventory management ruined Target Canada. Accessed April 10, 2020, https://www.tradegecko.com/blog/inventory-management/how-poor-inventory-management-ruined-target-canada.Google Scholar
  • Markowitz H (1952) Portfolio selection. J. Finance 7(1):77–91.Google Scholar
  • Perakis G, Roels G (2008) Regret in the newsvendor model with partial information. Oper. Res. 56(1):188–203.Google Scholar
  • Porteus E (2002) Foundations of Stochastic Inventory Theory (Stanford University Press, Stanford, CA).Google Scholar
  • Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 5192–5202.Google Scholar
  • Sinclair S, Banerjee S, Yu C (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM on Measurement and Analysis of Comput. Systems (ACM, New York), 1–44.Google Scholar
  • Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction, 2nd ed. (MIT Press, Cambridge, MA).Google Scholar
  • Watkins C, Dayan P (1992) Technical note: Q-learning. Machine Learn. 8:279–292.Google Scholar
  • Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.LinkGoogle Scholar
  • Zhang H, Chao X, Shi C (2020) Closing the gap: A learning algorithm for lost-sales inventory systems with lead times. Management Sci. 66(5):1962–1980.Google Scholar
  • Zhao H, Chen W (2019) Stochastic one-sided full-information bandit. Proc. Eur. Conf. on Machine Learn. and Principles and Practice of Knowledge Discovery in Databases (Springer, Cham), 150–166.Google Scholar
  • Zipkin P (1989) Critical number policies for inventory models with periodic data. Management Sci. 35(1):71–80.Google Scholar
  • Zipkin P (2000) Foundations of Inventory Management (McGraw-Hill, New York).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.