Deep Policy Iteration with Integer Programming for Inventory Management

Pavithra Harsha
Corresponding Author
Pavithra Harsha
[email protected]
https://orcid.org/0000-0002-6049-7739
Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598
Search for more papers by this author
,
Ashish Jagmohan
Ashish Jagmohan
[email protected]
Merlin Mind, New York, New York 10018
Search for more papers by this author
,
Jayant Kalagnanam
Jayant Kalagnanam
[email protected]
Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598
Search for more papers by this author
,
Brian Quanz
Brian Quanz
[email protected]
https://orcid.org/0000-0002-4136-5538
Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598
Search for more papers by this author
,
Divya Singhvi
Corresponding Author
Divya Singhvi
[email protected]
https://orcid.org/0000-0001-8763-015X
Leonard N. Stern School of Business, New York University, New York, New York 10012
Search for more papers by this author

Pavithra Harsha

Corresponding Author

Pavithra Harsha

[email protected]

https://orcid.org/0000-0002-6049-7739

Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598

Search for more papers by this author

Ashish Jagmohan

[email protected]

Merlin Mind, New York, New York 10018

Search for more papers by this author

Jayant Kalagnanam

[email protected]

Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598

Search for more papers by this author

Brian Quanz

[email protected]

https://orcid.org/0000-0002-4136-5538

Thomas J. Watson Research Center, IBM Research, Yorktown Heights, New York 10598

Search for more papers by this author

Divya Singhvi

Corresponding Author

Divya Singhvi

[email protected]

https://orcid.org/0000-0001-8763-015X

Leonard N. Stern School of Business, New York University, New York, New York 10012

Search for more papers by this author

Published Online:6 Jan 2025https://doi.org/10.1287/msom.2022.0617

References

Achiam J (2018) Spinning up in deep reinforcement learning. https://github.com/openai/spinningup.Google Scholar
Agrawal S, Jia R (2019) Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management. Proc. ACM Conf. Econom. Comput. (Association for Computing Machinery (ACM), New York), 743–744.Google Scholar
Agarwal R, Schwarzer M, Castro PS, Courville AC, Bellemare M (2021) Deep reinforcement learning at the edge of the statistical precipice. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 29304–29320.Google Scholar
Allon G, Van Mieghem JA (2010) Global dual sourcing: Tailored base-surge allocation to near-and offshore production. Management Sci. 56(1):110–124.Link, Google Scholar
Anderson R, Huchette J, Ma W, Tjandraatmadja C, Vielma JP (2020) Strong mixed-integer programming formulations for trained neural networks. Math. Programming 183(1):3–39.Crossref, Google Scholar
Bansal S, Nagarajan M (2022) A Monge sequence-based approach to characterize the competitive newsvendor problem. Oper. Res. 70(2):805–814.Link, Google Scholar
Bartlett PL, Harvey N, Liaw C, Mehrabian A (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
Bertsekas D (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas D (2017) Dynamic Programming and Optimal Control: Volume I and II (Athena Scientific, Belmont, MA).Google Scholar
Bolusani S, Besançon M, Bestuzheva K, Chmiela A, Dionísio J, Donkiewicz T, van Doornmalen J, et al. (2024) The SCIP Optimization Suite 9.0. Accessed October 7, 2024, https://optimization-online.org/2024/02/the-scip-optimization-suite-9-0/.Google Scholar
Boute RN, Gijsbrechts J, van Jaarsveld W, Vanvuchelen N (2022) Deep reinforcement learning for inventory control: A roadmap. Eur. J. Oper. Res. 298(2):401–412.Crossref, Google Scholar
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
Caro F, Gallien J (2010) Inventory management of a fast-fashion retail network. Oper. Res. 58(2):257–273.Link, Google Scholar
Clark AJ, Scarf H (1960) Optimal policies for a multi-echelon inventory problem. Management Sci. 6(4):475–490.Link, Google Scholar
de Kok T, Grob C, Laumanns M, Minner S, Rambau J, Schade K (2018) A typology and literature review on stochastic multi-echelon inventory models. Eur. J. Oper. Res. 269(3):955–983.Crossref, Google Scholar
De Moor BJ, Gijsbrechts J, Boute RN (2022) Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur. J. Oper. Res. 301(2):535–545.Crossref, Google Scholar
Delarue A, Anderson R, Tjandraatmadja C (2020) Reinforcement learning with combinatorial actions: An application to vehicle routing. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 609–620.Google Scholar
Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1329–1338.Google Scholar
Farias VF, Van Roy B (2007) An approximate dynamic programming approach to network revenue management. Accessed October 7, 2024, https://web.mit.edu/vivekf/www/papers/ADP-rm-07-03.pdf.Google Scholar
Federgruen A, Zipkin P (1984) Approximations of dynamic, multilocation production and inventory problems. Management Sci. 30(1):69–84.Link, Google Scholar
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1587–1596.Google Scholar
Giannoccaro I, Pontrandolfo P (2002) Inventory management in supply chains: A reinforcement learning approach. Internat. J. Production Econom. 78(2):153–161.Crossref, Google Scholar
Gijsbrechts J, Boute RN, Van Mieghem JA, Zhang D (2022) Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.Link, Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statistics (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 315–323.Google Scholar
Goldberg DA, Katz-Rogozhnikov DA, Lu Y, Sharma M, Squillante MS (2016) Asymptotic optimality of constant-order policies for lost sales inventory models with large lead times. Math. Oper. Res. 41(3):898–913.Link, Google Scholar
Golowich N, Rakhlin A, Shamir O (2018) Size-independent sample complexity of neural networks. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 297–299.Google Scholar
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Dy J, Krause A, ed. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 75 (PMLR, New York), 1861–1870.Google Scholar
Hara K, Saito D, Shouno H (2015) Analysis of function of rectified linear unit used in deep learning. Proc. Internat. Joint Conf. Neural Networks (Institute of Electrical and Electronics Engineers (IEEE), Piscataway, NJ), 1–8.Google Scholar
Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. McIlraith SA, Weinberger KQ, eds. Proc. 32nd AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3207–3214.Google Scholar
Hinton G, Srivastava N, Swersky K (2012) Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks Machine Learn. 4(2):26–31.Google Scholar
Hubbs CD, Perez HD, Sarwar O, Sahinidis NV, Grossmann IE, Wassick JM (2020) Or-gym: A reinforcement learning library for operations research problems. Preprint, submitted August 14, https://arxiv.org/abs/2008.06319.Google Scholar
Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009) Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci. 55(3):404–420.Link, Google Scholar
Kim S, Pasupathy R, Henderson SG (2015) A guide to sample average approximation. Handbook of Simulation Optimization (Springer New York, New York), 207–243.Crossref, Google Scholar
Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization (ICLR, Ithaca, NY).Google Scholar
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. Internat. J. Robotics Res. 32(11):1238–1274.Crossref, Google Scholar
Levi R, Janakiraman G, Nagarajan M (2008) A 2-approximation algorithm for stochastic inventory control models with lost sales. Math. Oper. Res. 33(2):351–374.Link, Google Scholar
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, et al. (2015) Continuous control with deep reinforcement learning. Preprint, submitted September 9, https://arxiv.org/abs/1509.02971.Google Scholar
Lougee-Heimer R (2003) The common optimization interface for operations research: Promoting open-source software in the operations research community. IBM J. Res. Development 47(1):57–66.Crossref, Google Scholar
Maei H, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 22 (Curran Associates, Inc., Red Hook, NY), 1205–1212.Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, et al. (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Proceedings of Machine Learning Research (PMLR), Cambridge, MA), 1928–1937.Google Scholar
Munos R (2003) Error bounds for approximate policy iteration. ICML (AAAI Press, Washington, DC), 560–567.Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(5).Google Scholar
Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.Link, Google Scholar
Özer Ö, Xiong H (2008) Stock positioning and performance estimation for distribution systems with service constraints. IIE Trans. 40(12):1141–1157.Crossref, Google Scholar
Pirhooshyaran M, Snyder LV (2020) Simultaneous decision making for stochastic multi-echelon inventory optimization with deep neural networks as decision makers. Preprint, submitted June 10, https://arxiv.org/abs/2006.05608.Google Scholar
Powell WB (2007) Approximate Dynamic Programming: Solving the Curses of Dimensionality, vol. 703 (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Qi M, Shi Y, Qi Y, Ma C, Yuan R, Wu D, Shen ZJ (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.Link, Google Scholar
Raffin A, Hill A, Ernestus M, Gleave A, Kanervisto A, Dormann N (2019) Stable Baselines3. https://github.com/DLR-RM/stable-baselines3.Google Scholar
Rong Y, Atan Z, Snyder LV (2017) Heuristics for base-stock levels in multi-echelon distribution networks. Production Oper. Management 26(9):1760–1777.Crossref, Google Scholar
Ryu M, Chow Y, Anderson R, Tjandraatmadja C, Boutilier C (2019) Caql: Continuous action q-learning. Proc. Internat. Conf. on Learn. Representations (Vancouver).Google Scholar
Scarf H (1960) The optimality of (s, s) policies in the dynamic inventory problem. Optimal Pricing, Inflation, and the Cost of Price Adjustment (MIT Press, Cambridge, MA), 49–56.Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
Shapiro A (2003) Monte Carlo sampling methods. Handbook Oper. Res. Management Sci. 10:353–425.Google Scholar
Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
Sheopuri A, Janakiraman G, Seshadri S (2010) New policies for the stochastic inventory control problem with two supply sources. Oper. Res. 58(3):734–745.Link, Google Scholar
Stockheim T, Schwind M, Koenig W (2003) A reinforcement learning approach for supply chain management. Proc. 1st Eur. Workshop Multi-Agent Systems (Oxford, UK).Google Scholar
Sultana NN, Meisheri H, Baniwal V, Nath S, Ravindran B, Khadilkar H (2020) Reinforcement learning for multi-product multi-node inventory management in supply chains. Preprint, submitted June 7, https://arxiv.org/abs/2006.04037.Google Scholar
Sun J, Van Mieghem JA (2019) Robust dual sourcing inventory management: Optimality of capped dual index policies and smoothing. Manufacturing Service Oper. Management 21(4):912–931.Link, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Tjandraatmadja C, Anderson R, Huchette J, Ma W, Patel KK, Vielma JP (2020) The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural network verification. Adv. Neural Inform. Processing Systems 33:21675–21686.Google Scholar
Topaloglu H (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.Link, Google Scholar
van Heeswijk W, La Poutré H (2019) Approximate dynamic programming with neural networks in linear discrete action spaces. Preprint, submitted February 26, https://arxiv.org/abs/1902.09855.Google Scholar
Van Roy B, Bertsekas DP, Lee Y, Tsitsiklis JN (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, New York), 4052–4057.Google Scholar
Veeraraghavan S, Scheller-Wolf A (2008) Now or later: A simple policy for effective dual sourcing in capacitated systems. Oper. Res. 56(4):850–864.Link, Google Scholar
Xin L (2021) Understanding the performance of capped base-stock policies in lost-sales inventory models. Oper. Res. 69(1):61–70.Link, Google Scholar
Xu S, Panwar SS, Kodialam M, Lakshman T (2020) Deep neural network approximated dynamic programming for combinatorial optimization. Proc. Conf. AAAI Artificial Intelligence 34:1684–1691.Crossref, Google Scholar
Yarotsky D (2017) Error bounds for approximations with deep relu networks. Neural Networks 94:103–114.Crossref, Google Scholar
Young L (2022) Companies face rising supply chain costs amid inventory challenges. Accessed October 7, 2024, https://www.wsj.com/articles/companies-face-rising-supply-chain-costs-amid-inventory-challenges-11655829235.Google Scholar
Yu C, Liu J, Nemati S (2019) Reinforcement learning in healthcare: A survey. Preprint, submitted August 22, https://arxiv.org/abs/1908.08796.Google Scholar
Zhang D, Adelman D (2009) An approximate dynamic programming approach to network revenue management with customer choice. Transportation Sci. 43(3):381–394.Link, Google Scholar
Zipkin P (2008a) Old and new methods for lost-sales inventory systems. Oper. Res. 56(5):1256–1263.Link, Google Scholar
Zipkin P (2008b) On the structure of lost-sales inventory models. Oper. Res. 56(4):937–944.Link, Google Scholar

cover image Manufacturing & Service Operations Management

Volume 27, Issue 2

March-April 2025

Pages 339-678, C2

Article Information

Supplemental Material

Metrics

Information

Received:December 08, 2022
Accepted:November 08, 2024
Published Online:January 06, 2025

Cite as

Pavithra Harsha; , Ashish Jagmohan; , Jayant Kalagnanam; , Brian Quanz; , Divya Singhvi (2025) Deep Policy Iteration with Integer Programming for Inventory Management. Manufacturing & Service Operations Management 27(2):369-388.

https://doi.org/10.1287/msom.2022.0617

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Deep Policy Iteration with Integer Programming for Inventory Management

References

Volume 27, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News