Deep Policy Iteration with Integer Programming for Inventory Management
Published Online:6 Jan 2025https://doi.org/10.1287/msom.2022.0617
References
- (2018) Spinning up in deep reinforcement learning. https://github.com/openai/spinningup.Google Scholar
- (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. ACM Conf. Econom. Comput. (Association for Computing Machinery (ACM), New York), 743–744.Google Scholar
- (2021) Deep reinforcement learning at the edge of the statistical precipice. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 29304–29320.Google Scholar
- (2010) Global dual sourcing: Tailored base-surge allocation to near-and offshore production. Management Sci. 56(1):110–124.Link, Google Scholar
- (2020) Strong mixed-integer programming formulations for trained neural networks. Math. Programming 183(1):3–39.Crossref, Google Scholar
- (2022) A Monge sequence-based approach to characterize the competitive newsvendor problem. Oper. Res. 70(2):805–814.Link, Google Scholar
- (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2017) Dynamic Programming and Optimal Control: Volume I and II (Athena Scientific, Belmont, MA).Google Scholar
- (2024) The SCIP Optimization Suite 9.0. Accessed October 7, 2024, https://optimization-online.org/2024/02/the-scip-optimization-suite-9-0/.Google Scholar
- (2022) Deep reinforcement learning for inventory control: A roadmap. Eur. J. Oper. Res. 298(2):401–412.Crossref, Google Scholar
- (2016) OpenAI Gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
- (2010) Inventory management of a fast-fashion retail network. Oper. Res. 58(2):257–273.Link, Google Scholar
- (1960) Optimal policies for a multi-echelon inventory problem. Management Sci. 6(4):475–490.Link, Google Scholar
- (2018) A typology and literature review on stochastic multi-echelon inventory models. Eur. J. Oper. Res. 269(3):955–983.Crossref, Google Scholar
- (2022) Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur. J. Oper. Res. 301(2):535–545.Crossref, Google Scholar
- (2020) Reinforcement learning with combinatorial actions: An application to vehicle routing. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 609–620.Google Scholar
- (2016) Benchmarking deep reinforcement learning for continuous control. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1329–1338.Google Scholar
- (2007) An approximate dynamic programming approach to network revenue management. Accessed October 7, 2024, https://web.mit.edu/vivekf/www/papers/ADP-rm-07-03.pdf.Google Scholar
- (1984) Approximations of dynamic, multilocation production and inventory problems. Management Sci. 30(1):69–84.Link, Google Scholar
- (2018) Addressing function approximation error in actor-critic methods. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1587–1596.Google Scholar
- (2002) Inventory management in supply chains: A reinforcement learning approach. Internat. J. Production Econom. 78(2):153–161.Crossref, Google Scholar
- (2022) Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.Link, Google Scholar
- (2011) Deep sparse rectifier neural networks. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statistics (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 315–323.Google Scholar
- (2016) Asymptotic optimality of constant-order policies for lost sales inventory models with large lead times. Math. Oper. Res. 41(3):898–913.Link, Google Scholar
- (2018) Size-independent sample complexity of neural networks. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 297–299.Google Scholar
- (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Dy J, Krause A, ed. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1861–1870.Google Scholar
- (2015) Analysis of function of rectified linear unit used in deep learning. Proc. Internat. Joint Conf. Neural Networks (Institute of Electrical and Electronics Engineers (IEEE), Piscataway, NJ), 1–8.Google Scholar
- (2018) Deep reinforcement learning that matters. McIlraith SA, Weinberger KQ, eds. Proc. 32nd AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3207–3214.Google Scholar
- (2012) Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks Machine Learn. 4(2):26–31.Google Scholar
- (2020) Or-gym: A reinforcement learning library for operations research problems. Preprint, submitted August 14, https://arxiv.org/abs/2008.06319.Google Scholar
- (2009) Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci. 55(3):404–420.Link, Google Scholar
- (2015) A guide to sample average approximation. Handbook of Simulation Optimization (Springer New York, New York), 207–243.Crossref, Google Scholar
- (2015) Adam: A Method for Stochastic Optimization (ICLR, Ithaca, NY).Google Scholar
- (2013) Reinforcement learning in robotics: A survey. Internat. J. Robotics Res. 32(11):1238–1274.Crossref, Google Scholar
- (2008) A 2-approximation algorithm for stochastic inventory control models with lost sales. Math. Oper. Res. 33(2):351–374.Link, Google Scholar
- (2015) Continuous control with deep reinforcement learning. Preprint, submitted September 9, https://arxiv.org/abs/1509.02971.Google Scholar
- (2003) The common optimization interface for operations research: Promoting open-source software in the operations research community. IBM J. Res. Development 47(1):57–66.Crossref, Google Scholar
- (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1205–1212.Google Scholar
- (2013) Playing atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
- (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1928–1937.Google Scholar
- (2003) Error bounds for approximate policy iteration. ICML (AAAI Press, Washington, DC), 560–567.Google Scholar
- (2008) Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(5).Google Scholar
- (2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.Link, Google Scholar
- (2008) Stock positioning and performance estimation for distribution systems with service constraints. IIE Trans. 40(12):1141–1157.Crossref, Google Scholar
- (2020) Simultaneous decision making for stochastic multi-echelon inventory optimization with deep neural networks as decision makers. Preprint, submitted June 10, https://arxiv.org/abs/2006.05608.Google Scholar
- (2007) Approximate Dynamic Programming: Solving the Curses of Dimensionality, vol. 703 (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.Link, Google Scholar
- (2019) Stable Baselines3. https://github.com/DLR-RM/stable-baselines3.Google Scholar
- (2017) Heuristics for base-stock levels in multi-echelon distribution networks. Production Oper. Management 26(9):1760–1777.Crossref, Google Scholar
- (2019) Caql: Continuous action q-learning. Proc. Internat. Conf. on Learn. Representations (Vancouver).Google Scholar
- (1960) The optimality of (s, s) policies in the dynamic inventory problem. Optimal Pricing, Inflation, and the Cost of Price Adjustment (MIT Press, Cambridge, MA), 49–56.Google Scholar
- (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
- (2003) Monte Carlo sampling methods. Handbook Oper. Res. Management Sci. 10:353–425.Google Scholar
- (2014) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
- (2010) New policies for the stochastic inventory control problem with two supply sources. Oper. Res. 58(3):734–745.Link, Google Scholar
- (2003) A reinforcement learning approach for supply chain management. Proc. 1st Eur. Workshop Multi-Agent Systems (Oxford, UK).Google Scholar
- (2020) Reinforcement learning for multi-product multi-node inventory management in supply chains. Preprint, submitted June 7, https://arxiv.org/abs/2006.04037.Google Scholar
- (2019) Robust dual sourcing inventory management: Optimality of capped dual index policies and smoothing. Manufacturing Service Oper. Management 21(4):912–931.Link, Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (2020) The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural network verification. Adv. Neural Inform. Processing Systems 33:21675–21686.Google Scholar
- (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.Link, Google Scholar
- (2019) Approximate dynamic programming with neural networks in linear discrete action spaces. Preprint, submitted February 26, https://arxiv.org/abs/1902.09855.Google Scholar
- (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, New York), 4052–4057.Google Scholar
- (2008) Now or later: A simple policy for effective dual sourcing in capacitated systems. Oper. Res. 56(4):850–864.Link, Google Scholar
- (2021) Understanding the performance of capped base-stock policies in lost-sales inventory models. Oper. Res. 69(1):61–70.Link, Google Scholar
- (2020) Deep neural network approximated dynamic programming for combinatorial optimization. Proc. Conf. AAAI Artificial Intelligence 34:1684–1691.Crossref, Google Scholar
- (2017) Error bounds for approximations with deep relu networks. Neural Networks 94:103–114.Crossref, Google Scholar
- (2022) Companies face rising supply chain costs amid inventory challenges. Accessed October 7, 2024, https://www.wsj.com/articles/companies-face-rising-supply-chain-costs-amid-inventory-challenges-11655829235.Google Scholar
- (2019) Reinforcement learning in healthcare: A survey. Preprint, submitted August 22, https://arxiv.org/abs/1908.08796.Google Scholar
- (2009) An approximate dynamic programming approach to network revenue management with customer choice. Transportation Sci. 43(3):381–394.Link, Google Scholar
- (2008a) Old and new methods for lost-sales inventory systems. Oper. Res. 56(5):1256–1263.Link, Google Scholar
- (2008b) On the structure of lost-sales inventory models. Oper. Res. 56(4):937–944.Link, Google Scholar

