Deep Policy Iteration with Integer Programming for Inventory Management

Published Online:https://doi.org/10.1287/msom.2022.0617

References

  • Achiam J (2018) Spinning up in deep reinforcement learning. https://github.com/openai/spinningup.Google Scholar
  • Agrawal S, Jia R (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. ACM Conf. Econom. Comput. (Association for Computing Machinery (ACM), New York), 743–744.Google Scholar
  • Agarwal R, Schwarzer M, Castro PS, Courville AC, Bellemare M (2021) Deep reinforcement learning at the edge of the statistical precipice. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 29304–29320.Google Scholar
  • Allon G, Van Mieghem JA (2010) Global dual sourcing: Tailored base-surge allocation to near-and offshore production. Management Sci. 56(1):110–124.LinkGoogle Scholar
  • Anderson R, Huchette J, Ma W, Tjandraatmadja C, Vielma JP (2020) Strong mixed-integer programming formulations for trained neural networks. Math. Programming 183(1):3–39.CrossrefGoogle Scholar
  • Bansal S, Nagarajan M (2022) A Monge sequence-based approach to characterize the competitive newsvendor problem. Oper. Res. 70(2):805–814.LinkGoogle Scholar
  • Bartlett PL, Harvey N, Liaw C, Mehrabian A (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
  • Bertsekas D (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas D (2017) Dynamic Programming and Optimal Control: Volume I and II (Athena Scientific, Belmont, MA).Google Scholar
  • Bolusani S, Besançon M, Bestuzheva K, Chmiela A, Dionísio J, Donkiewicz T, van Doornmalen J, et al. (2024) The SCIP Optimization Suite 9.0. Accessed October 7, 2024, https://optimization-online.org/2024/02/the-scip-optimization-suite-9-0/.Google Scholar
  • Boute RN, Gijsbrechts J, van Jaarsveld W, Vanvuchelen N (2022) Deep reinforcement learning for inventory control: A roadmap. Eur. J. Oper. Res. 298(2):401–412.CrossrefGoogle Scholar
  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
  • Caro F, Gallien J (2010) Inventory management of a fast-fashion retail network. Oper. Res. 58(2):257–273.LinkGoogle Scholar
  • Clark AJ, Scarf H (1960) Optimal policies for a multi-echelon inventory problem. Management Sci. 6(4):475–490.LinkGoogle Scholar
  • de Kok T, Grob C, Laumanns M, Minner S, Rambau J, Schade K (2018) A typology and literature review on stochastic multi-echelon inventory models. Eur. J. Oper. Res. 269(3):955–983.CrossrefGoogle Scholar
  • De Moor BJ, Gijsbrechts J, Boute RN (2022) Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur. J. Oper. Res. 301(2):535–545.CrossrefGoogle Scholar
  • Delarue A, Anderson R, Tjandraatmadja C (2020) Reinforcement learning with combinatorial actions: An application to vehicle routing. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 609–620.Google Scholar
  • Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1329–1338.Google Scholar
  • Farias VF, Van Roy B (2007) An approximate dynamic programming approach to network revenue management. Accessed October 7, 2024, https://web.mit.edu/vivekf/www/papers/ADP-rm-07-03.pdf.Google Scholar
  • Federgruen A, Zipkin P (1984) Approximations of dynamic, multilocation production and inventory problems. Management Sci. 30(1):69–84.LinkGoogle Scholar
  • Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1587–1596.Google Scholar
  • Giannoccaro I, Pontrandolfo P (2002) Inventory management in supply chains: A reinforcement learning approach. Internat. J. Production Econom. 78(2):153–161.CrossrefGoogle Scholar
  • Gijsbrechts J, Boute RN, Van Mieghem JA, Zhang D (2022) Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.LinkGoogle Scholar
  • Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statistics (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 315–323.Google Scholar
  • Goldberg DA, Katz-Rogozhnikov DA, Lu Y, Sharma M, Squillante MS (2016) Asymptotic optimality of constant-order policies for lost sales inventory models with large lead times. Math. Oper. Res. 41(3):898–913.LinkGoogle Scholar
  • Golowich N, Rakhlin A, Shamir O (2018) Size-independent sample complexity of neural networks. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 297–299.Google Scholar
  • Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Dy J, Krause A, ed. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1861–1870.Google Scholar
  • Hara K, Saito D, Shouno H (2015) Analysis of function of rectified linear unit used in deep learning. Proc. Internat. Joint Conf. Neural Networks (Institute of Electrical and Electronics Engineers (IEEE), Piscataway, NJ), 1–8.Google Scholar
  • Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. McIlraith SA, Weinberger KQ, eds. Proc. 32nd AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3207–3214.Google Scholar
  • Hinton G, Srivastava N, Swersky K (2012) Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks Machine Learn. 4(2):26–31.Google Scholar
  • Hubbs CD, Perez HD, Sarwar O, Sahinidis NV, Grossmann IE, Wassick JM (2020) Or-gym: A reinforcement learning library for operations research problems. Preprint, submitted August 14, https://arxiv.org/abs/2008.06319.Google Scholar
  • Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009) Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci. 55(3):404–420.LinkGoogle Scholar
  • Kim S, Pasupathy R, Henderson SG (2015) A guide to sample average approximation. Handbook of Simulation Optimization (Springer New York, New York), 207–243.CrossrefGoogle Scholar
  • Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization (ICLR, Ithaca, NY).Google Scholar
  • Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. Internat. J. Robotics Res. 32(11):1238–1274.CrossrefGoogle Scholar
  • Levi R, Janakiraman G, Nagarajan M (2008) A 2-approximation algorithm for stochastic inventory control models with lost sales. Math. Oper. Res. 33(2):351–374.LinkGoogle Scholar
  • Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, et al. (2015) Continuous control with deep reinforcement learning. Preprint, submitted September 9, https://arxiv.org/abs/1509.02971.Google Scholar
  • Lougee-Heimer R (2003) The common optimization interface for operations research: Promoting open-source software in the operations research community. IBM J. Res. Development 47(1):57–66.CrossrefGoogle Scholar
  • Maei H, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1205–1212.Google Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, et al. (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1928–1937.Google Scholar
  • Munos R (2003) Error bounds for approximate policy iteration. ICML (AAAI Press, Washington, DC), 560–567.Google Scholar
  • Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(5).Google Scholar
  • Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.LinkGoogle Scholar
  • Özer Ö, Xiong H (2008) Stock positioning and performance estimation for distribution systems with service constraints. IIE Trans. 40(12):1141–1157.CrossrefGoogle Scholar
  • Pirhooshyaran M, Snyder LV (2020) Simultaneous decision making for stochastic multi-echelon inventory optimization with deep neural networks as decision makers. Preprint, submitted June 10, https://arxiv.org/abs/2006.05608.Google Scholar
  • Powell WB (2007) Approximate Dynamic Programming: Solving the Curses of Dimensionality, vol. 703 (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Qi M, Shi Y, Qi Y, Ma C, Yuan R, Wu D, Shen ZJ (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.LinkGoogle Scholar
  • Raffin A, Hill A, Ernestus M, Gleave A, Kanervisto A, Dormann N (2019) Stable Baselines3. https://github.com/DLR-RM/stable-baselines3.Google Scholar
  • Rong Y, Atan Z, Snyder LV (2017) Heuristics for base-stock levels in multi-echelon distribution networks. Production Oper. Management 26(9):1760–1777.CrossrefGoogle Scholar
  • Ryu M, Chow Y, Anderson R, Tjandraatmadja C, Boutilier C (2019) Caql: Continuous action q-learning. Proc. Internat. Conf. on Learn. Representations (Vancouver).Google Scholar
  • Scarf H (1960) The optimality of (s, s) policies in the dynamic inventory problem. Optimal Pricing, Inflation, and the Cost of Price Adjustment (MIT Press, Cambridge, MA), 49–56.Google Scholar
  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
  • Shapiro A (2003) Monte Carlo sampling methods. Handbook Oper. Res. Management Sci. 10:353–425.Google Scholar
  • Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).CrossrefGoogle Scholar
  • Sheopuri A, Janakiraman G, Seshadri S (2010) New policies for the stochastic inventory control problem with two supply sources. Oper. Res. 58(3):734–745.LinkGoogle Scholar
  • Stockheim T, Schwind M, Koenig W (2003) A reinforcement learning approach for supply chain management. Proc. 1st Eur. Workshop Multi-Agent Systems (Oxford, UK).Google Scholar
  • Sultana NN, Meisheri H, Baniwal V, Nath S, Ravindran B, Khadilkar H (2020) Reinforcement learning for multi-product multi-node inventory management in supply chains. Preprint, submitted June 7, https://arxiv.org/abs/2006.04037.Google Scholar
  • Sun J, Van Mieghem JA (2019) Robust dual sourcing inventory management: Optimality of capped dual index policies and smoothing. Manufacturing Service Oper. Management 21(4):912–931.LinkGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Tjandraatmadja C, Anderson R, Huchette J, Ma W, Patel KK, Vielma JP (2020) The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural network verification. Adv. Neural Inform. Processing Systems 33:21675–21686.Google Scholar
  • Topaloglu H (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.LinkGoogle Scholar
  • van Heeswijk W, La Poutré H (2019) Approximate dynamic programming with neural networks in linear discrete action spaces. Preprint, submitted February 26, https://arxiv.org/abs/1902.09855.Google Scholar
  • Van Roy B, Bertsekas DP, Lee Y, Tsitsiklis JN (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, New York), 4052–4057.Google Scholar
  • Veeraraghavan S, Scheller-Wolf A (2008) Now or later: A simple policy for effective dual sourcing in capacitated systems. Oper. Res. 56(4):850–864.LinkGoogle Scholar
  • Xin L (2021) Understanding the performance of capped base-stock policies in lost-sales inventory models. Oper. Res. 69(1):61–70.LinkGoogle Scholar
  • Xu S, Panwar SS, Kodialam M, Lakshman T (2020) Deep neural network approximated dynamic programming for combinatorial optimization. Proc. Conf. AAAI Artificial Intelligence 34:1684–1691.CrossrefGoogle Scholar
  • Yarotsky D (2017) Error bounds for approximations with deep relu networks. Neural Networks 94:103–114.CrossrefGoogle Scholar
  • Young L (2022) Companies face rising supply chain costs amid inventory challenges. Accessed October 7, 2024, https://www.wsj.com/articles/companies-face-rising-supply-chain-costs-amid-inventory-challenges-11655829235.Google Scholar
  • Yu C, Liu J, Nemati S (2019) Reinforcement learning in healthcare: A survey. Preprint, submitted August 22, https://arxiv.org/abs/1908.08796.Google Scholar
  • Zhang D, Adelman D (2009) An approximate dynamic programming approach to network revenue management with customer choice. Transportation Sci. 43(3):381–394.LinkGoogle Scholar
  • Zipkin P (2008a) Old and new methods for lost-sales inventory systems. Oper. Res. 56(5):1256–1263.LinkGoogle Scholar
  • Zipkin P (2008b) On the structure of lost-sales inventory models. Oper. Res. 56(4):937–944.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.