Bandits atop Reinforcement Learning: Tackling Online Inventory Models with Cyclic Demands
Published Online:26 Oct 2023https://doi.org/10.1287/mnsc.2023.4947
References
- (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- (2022) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Oper. Res. 70(3):1646–1664.Google Scholar
- (1997) Stochastic inventory models with limited production capacity and periodically varying parameters. Probability Engrg. Inform. Sci. 11(1):107–135.Crossref, Google Scholar
- (2019) Contextual bandits with cross-learning. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- (1998) Multiperiod airline overbooking with a single fare class. Oper. Res. 46(6):805–819.Link, Google Scholar
- (2021) Production Oper. Management 30(5):1365–1385.Google Scholar
- (2019) Tailored base-surge policies in dual-sourcing inventory systems with demand learning. Preprint, submitted September 27, https://dx.doi.org/10.2139/ssrn.3456834.Google Scholar
- (2020) Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Daumé III H, Aarti S, eds. Proc. 37th Internat. Conf. Machine Learn. Proceedings of Machine Learning Research Series, vol. 119 (PMLR, New York),1843–1854.Google Scholar
- (2020) Reinforcement learning with feedback graphs. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin Hm, eds. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 16868–16878.Google Scholar
- (2022) Dynamic inventory control with fixed setup costs and unknown discrete demand distribution. Oper. Res. 70(3):1560–1576.Google Scholar
- (2019) Provably efficient reinforcement learning with aggregated states. Preprint, submitted December 13, https://doi.org/10.48550/arXiv.1912.06366.Google Scholar
- (2014) Demand seasonality in retail inventory management. Eur. J. Oper. Res. 238(2):527–539.Crossref, Google Scholar
- (2009a) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Link, Google Scholar
- (2009b) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Link, Google Scholar
- (2014) Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS J. Comput. 26(1):150–159.Link, Google Scholar
- (2009) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397–416.Link, Google Scholar
- (2011) Adaptive data-driven inventory control with censored demand based on Kaplan-Meier estimator. Oper. Res. 59(4):929–941.Link, Google Scholar
- (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY). Google Scholar
- Kaggle (2015) Rossmann store sales. Accessed August 15, 2020, https://www.kaggle.com/c/rossmann-store-sales/overview.Google Scholar
- (1960) Optimal policy for dynamic inventory process with stochastic demands subject to seasonal variations. J. Soc. Industrial Appl. Math. 8(4):611–629.Google Scholar
- (2016) How poor inventory management ruined Target Canada. Accessed April 10, 2020, https://www.tradegecko.com/blog/inventory-management/how-poor-inventory-management-ruined-target-canada.Google Scholar
- (1952) Portfolio selection. J. Finance 7(1):77–91.Google Scholar
- (2008) Regret in the newsvendor model with partial information. Oper. Res. 56(1):188–203.Google Scholar
- (2002) Foundations of Stochastic Inventory Theory (Stanford University Press, Stanford, CA).Google Scholar
- (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 5192–5202.Google Scholar
- (2019) Adaptive discretization for episodic reinforcement learning in metric spaces. Proc. ACM on Measurement and Analysis of Comput. Systems (ACM, New York), 1–44.Google Scholar
- (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Google Scholar
- (2018) Reinforcement Learning: An Introduction, 2nd ed. (MIT Press, Cambridge, MA).Google Scholar
- (1992) Technical note: Q-learning. Machine Learn. 8:279–292.Google Scholar
- (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Sci. 67(10):6089–6115.Link, Google Scholar
- (2020) Closing the gap: A learning algorithm for lost-sales inventory systems with lead times. Management Sci. 66(5):1962–1980.Google Scholar
- (2019) Stochastic one-sided full-information bandit. Proc. Eur. Conf. on Machine Learn. and Principles and Practice of Knowledge Discovery in Databases (Springer, Cham), 150–166.Google Scholar
- (1989) Critical number policies for inventory models with periodic data. Management Sci. 35(1):71–80.Google Scholar
- (2000) Foundations of Inventory Management (McGraw-Hill, New York).Google Scholar

