A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management

Yi Chen
Yi Chen
[email protected]
Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong
Search for more papers by this author
,
Jing Dong
Corresponding Author
Jing Dong
[email protected]
https://orcid.org/0000-0001-6387-4088
Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author
,
Zhaoran Wang
Zhaoran Wang
[email protected]
https://orcid.org/0000-0002-1824-2580
Department of Industrial Engineering & Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Chuheng Zhang
Chuheng Zhang
[email protected]
Microsoft Research, Beijing, China
Search for more papers by this author

Yi Chen

[email protected]

Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong

Search for more papers by this author

Jing Dong

Corresponding Author

Jing Dong

[email protected]

https://orcid.org/0000-0001-6387-4088

Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Zhaoran Wang

[email protected]

https://orcid.org/0000-0002-1824-2580

Department of Industrial Engineering & Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Chuheng Zhang

[email protected]

Microsoft Research, Beijing, China

Search for more papers by this author

Published Online:20 May 2025https://doi.org/10.1287/mnsc.2022.03736

References

Adelman D, Mersereau AJ (2008) Relaxations of weakly coupled stochastic dynamic programs. Oper. Res. 56(3):712–727.Link, Google Scholar
Altman E (1994) Denumerable constrained Markov decision processes and finite approximations. Math. Oper. Res. 19(1):169–191.Link, Google Scholar
Altman E (1999) Constrained Markov Decision Processes, vol. 7 (Chapman and Hall/CRC, Boca Raton, FL).Google Scholar
Altman E, Shwartz A (1991) Markov decision problems and state-action frequencies. SIAM J. Control Optim. 29(4):786–809.Crossref, Google Scholar
Antos A, Szepesvári C, Munos R (2007) Fitted Q-iteration in continuous action-space MDPs. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates, Inc., Red Hook, NY).Google Scholar
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer, New York).Crossref, Google Scholar
Balseiro SR, Brown DB, Chen C (2021) Dynamic pricing of relocating resources in large networks. Management Sci. 67(7):4075–4094.Google Scholar
Bertsekas DP (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. ii (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (2015) Convex Optimization Algorithms (Athena Scientific, Belmont, MA).Google Scholar
Bertsimas D, Mišić VV (2016) Decomposable Markov decision processes: A fluid optimization approach. Oper. Res. 64(6):1537–1555.Link, Google Scholar
Bertsimas D, Orlin JB (1994) A technique for speeding up the solution of the Lagrangian dual. Math. Programming 63(1–3):23–45.Crossref, Google Scholar
Bhatnagar S, Lakshmanan K (2012) An online actor–critic algorithm with function approximation for constrained Markov decision processes. J. Optim. Theory Appl. 153(3):688–708.Crossref, Google Scholar
Borkar VS (2005) An actor-critic algorithm for constrained Markov decision processes. Systems Control Lett. 54(3):207–213.Crossref, Google Scholar
Boutilier C, Lu T (2016) Budget allocation using weakly coupled, constrained Markov decision processes. Proc. 32nd Conf. Uncertainty Artificial Intelligence (AUAI Press, Jersey City, NJ).Google Scholar
Boyd S, Boyd SP, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Brown DB, Smith JE (2020) Index policies and performance bounds for dynamic selection problems. Management Sci. 66(7):3029–3050.Link, Google Scholar
Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, vol. 8 (Now Publishers Inc., Hanover, MA), 231–357.Crossref, Google Scholar
Caramanis C, Dimitrov NB, Morton DP (2014) Efficient algorithms for budget-constrained Markov decision processes. IEEE Trans. Automatic Control 59(10):2813–2817.Crossref, Google Scholar
Chen Y, Wang M (2016) Stochastic primal-dual methods and sample complexity of reinforcement learning. Preprint, submitted December 8, https://arxiv.org/abs/1612.02516.Google Scholar
Chen J, Dong J, Shi P (2020) A survey on skill-based routing with applications to service operations management. Queueing Systems 96:53–82.Crossref, Google Scholar
Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learn. Res. 18(1):6070–6120.Google Scholar
Chow Y, Nachum O, Duenez-Guzman E, Ghavamzadeh M (2018) A Lyapunov-based approach to safe reinforcement learning. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 8092–8101.Google Scholar
Dai J, Gluzman M (2020) Queueing network controls via deep reinforcement learning. Preprint, submitted July 31, https://arxiv.org/abs/2008.01644.Google Scholar
Dai J, Lin W (2008) Asymptotic optimality of maximum pressure policies in stochastic processing networks. Ann. Appl. Probab. 18(6):2239–2299.Crossref, Google Scholar
Dai J, Shi P (2019) Inpatient overflow: An approximate dynamic programming approach. Manufacturing Service Oper. Management 21(4):894–911.Link, Google Scholar
De Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.Link, Google Scholar
Ding D, Jovanović MR (2022) Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces. 2022 IEEE 61st Conf. Decision Control (IEEE, Piscataway, NJ), 4892–4897.Google Scholar
Dong J, Shi P, Zheng F, Jin X (2019) Off-service placement in inpatient ward network: Resource pooling versus service slowdown. Working paper, Columbia Business School, New York.Google Scholar
Du S, Lee J, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn. (PMLR, New York), 1339–1348.Google Scholar
Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Learn. Dynam. Control (PMLR, New York), 486–489.Google Scholar
Farahmand A, Ghavamzadeh M, Szepesvári C, Mannor S (2016) Regularized policy iteration with nonparametric function spaces. J. Machine Learn. Res. 17(1):4809–4874.Google Scholar
Gattami A (2019) Reinforcement learning for multi-objective and constrained Markov decision processes. Preprint, submitted January 23, https://arxiv.org/abs/1901.08978.Google Scholar
Gijsbrechts J, Boute RN, Van Mieghem JA, Zhang D (2022) Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.Link, Google Scholar
Girard C, Green LV, Lewis ME, Xie J (2023) Two-class constrained optimization with applications to queueing control. Naval Res. Logist. 75(3):397–422.Crossref, Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 249–256.Google Scholar
Gu H, Guo X, Wei X, Xu R (2021) Mean-field multi-agent reinforcement learning: A decentralized network approach. Preprint, submitted August 5, https://arxiv.org/abs/2108.02731.Google Scholar
Hayes CF, Rădulescu R, Bargiacchi E, Källström J, Macfarlane M, Reymond M, Verstraeten T, et al. (2022) A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents Multi-Agent Systems 36(1):1–59.Crossref, Google Scholar
Iglehart DL (1963) Optimality of (s, s) policies in the infinite horizon dynamic inventory problem. Management Sci. 9(2):259–267.Link, Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2 (Morgan Kaufmann Publishers Inc., San Francisco, CA), 267–274.Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
Lan G (2023) Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Math. Programming 198(1):1059–1106.Crossref, Google Scholar
Le HM, Voloshin C, Yue Y (2019) Batch policy learning under constraints. Preprint, submitted March 20, https://arxiv.org/abs/1903.08738.Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444.Crossref, Google Scholar
Lee D, He N (2019) Stochastic primal-dual Q-learning algorithm for discounted MDPs. 2019 Amer. Control Conf. (IEEE, Piscataway, NJ), 4897–4902.Google Scholar
Lin Q, Nadarajah S, Soheili N (2020) Revisiting approximate linear programming: Constraint-violation learning with applications to inventory control and energy storage. Management Sci. 66(4):1544–1562.Link, Google Scholar
Liu B, Cai Q, Yang Z, Wang Z (2019) Neural proximal/trust region policy optimization attains globally optimal policy. Preprint, submitted June 25, https://arxiv.org/abs/1906.10306.Google Scholar
Mandelbaum A, Stolyar AL (2004) Scheduling flexible servers with convex delay costs: Heavy-traffic optimality of the generalized cμ-rule. Oper. Res. 52(6):836–855.Link, Google Scholar
Mannor S, Shimkin N (2001) The steering approach for multi-criteria reinforcement learning. Dietterich T, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
Miryoosefi S, Brantley K, Daume H III, Dudik M, Schapire RE (2019) Reinforcement learning with convex constraints. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 14093–14102.Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
Moallemi CC, Kumar S, Van Roy B (2008) Approximate and data-driven dynamic programming for queueing networks. Unpublished manuscript.Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
Nadarajah S, Margot F, Secomandi N (2015) Relaxations of approximate linear programs for the real option management of commodity storage. Management Sci. 61(12):3054–3076.Link, Google Scholar
Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. Proc. 22nd Internat. Conf. Machine Learn. (ACM, New York), 601–608.Google Scholar
Nedić A, Ozdaglar A (2009) Subgradient methods for saddle-point problems. J. Optim. Theory Appl. 142(1):205–228.Crossref, Google Scholar
Neely MJ (2011) Online fractional programming for Markov decision systems. Forty-ninth Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 353–360.Google Scholar
Nemirovski A (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1):229–251.Crossref, Google Scholar
Nocedal J, Wright SJ (1999) Numerical Optimization (Springer, New York).Crossref, Google Scholar
Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep Q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.Link, Google Scholar
Pakiman P, Nadarajah S, Soheili N, Lin Q (2020) Self-guided approximate linear programs. Preprint, submitted January 9, https://arxiv.org/abs/2001.02798.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
Qi M, Shi Y, Qi Y, Ma C, Yuan R, Wu D, Shen ZJ (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.Link, Google Scholar
Sam T, Chen Y, Yu CL (2022) Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure. Preprint, submitted June 7, https://arxiv.org/abs/2206.03569.Google Scholar
Scarf H (1960) The optimality of (S, s) policies in the dynamic inventory problem. Sheshinski E, Wiess Y, eds. Optimal Pricing, Inflation, and the Cost of Price Adjustment (The MIT Press, Cambridge, MA), 49–56.Google Scholar
Schweitzer PJ, Seidmann A (1985) Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. 110(2):568–582.Crossref, Google Scholar
Singh SP, Cohn D (1998) How to dynamically merge Markov decision processes. Jordan M, Kearns M, Solla S, eds. Advances in Neural Information Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
Singh R, Kumar P (2018) Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: Unreliable links. IEEE Trans. Automatic Control 64(1):127–142.Crossref, Google Scholar
Sion M (1958) On general minimax theorems. Pacific J. Math. 8(1):171–176.Crossref, Google Scholar
Song H, Tucker AL, Graue R, Moravick S, Yang JJ (2020) Capacity pooling in hospitals: The hidden consequences of off-service placement. Management Sci. 66(9):3825–3842.Link, Google Scholar
Sun P, Wang K, Zipkin P (2014) Quadratic approximation of cost functions in lost sales and perishable inventory control problems. Unpublished manuscript.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Tesauro G, Das R, Chan H, Kephart J, Levine D, Rawson F, Lefurgy C (2007) Managing power consumption and performance of computing systems using reinforcement learning. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates Inc., Red Hook, NY).Google Scholar
Tessler C, Mankowitz DJ, Mannor S (2018) Reward constrained policy optimization. Preprint, submitted May 28, https://arxiv.org/abs/1805.11074.Google Scholar
Tiapkin D, Gasnikov A (2022) Primal-dual stochastic mirror descent for mdps. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9723–9740.Google Scholar
Topaloglu H (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.Link, Google Scholar
Topaloglu H, Kunnumkal S (2006) Approximate dynamic programming methods for an inventory allocation problem under uncertainty. Naval Res. Logist. 53(8):822–841.Crossref, Google Scholar
Turken N, Tan Y, Vakharia AJ, Wang L, Wang R, Yenipazarli A (2012) The multi-product newsvendor problem: Review, extensions, and directions for future research. Handbook of Newsvendor Problems (Springer, New York), 3–39.Crossref, Google Scholar
Van Roy B, Bertsekas DP, Lee Y, Tsitsiklis JN (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, Piscataway, NJ), 4052–4057.Google Scholar
Veatch MH (2005) Approximate dynamic programming for networks: Fluid models and constraint reduction. Unpublished manuscript.Google Scholar
Veinott AF Jr, Wagner HM (1965) Computing optimal (s, s) inventory policies. Management Sci. 11(5):525–552.Link, Google Scholar
Yang R, Sun X, Narasimhan K (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar
Yu H, Neely M, Wei X (2017) Online convex optimization with stochastic constraints. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY).Google Scholar
Zhang X, Frazier PI (2022) Near-optimality for infinite-horizon restless bandits with many arms. Preprint, submitted March 29, https://arxiv.org/abs/2203.15853.Google Scholar
Zhou D, Chen J, Gu Q (2020) Provable multi-objective reinforcement learning with generative models. Preprint, submitted November 19, https://arxiv.org/abs/2011.10134.Google Scholar

Volume 72, Issue 2

February 2026

Pages 783-1726, iv-vi

Article Information

Supplemental Material

Metrics

Information

Received:December 02, 2022
Accepted:August 12, 2024
Published Online:May 20, 2025

Cite as

Yi Chen, Jing Dong, Zhaoran Wang, Chuheng Zhang (2025) A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management. Management Science 72(2):955-988.

https://doi.org/10.1287/mnsc.2022.03736

Keywords

Acknowledgments

The authors thank the associate editor and referees for their valuable comments and suggestions that helped improve the paper.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management

References

Volume 72, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News