A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management

Published Online:https://doi.org/10.1287/mnsc.2022.03736

References

  • Adelman D, Mersereau AJ (2008) Relaxations of weakly coupled stochastic dynamic programs. Oper. Res. 56(3):712–727.LinkGoogle Scholar
  • Altman E (1994) Denumerable constrained Markov decision processes and finite approximations. Math. Oper. Res. 19(1):169–191.LinkGoogle Scholar
  • Altman E (1999) Constrained Markov Decision Processes, vol. 7 (Chapman and Hall/CRC, Boca Raton, FL).Google Scholar
  • Altman E, Shwartz A (1991) Markov decision problems and state-action frequencies. SIAM J. Control Optim. 29(4):786–809.CrossrefGoogle Scholar
  • Antos A, Szepesvári C, Munos R (2007) Fitted Q-iteration in continuous action-space MDPs. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer, New York).CrossrefGoogle Scholar
  • Balseiro SR, Brown DB, Chen C (2021) Dynamic pricing of relocating resources in large networks. Management Sci. 67(7):4075–4094.Google Scholar
  • Bertsekas DP (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. ii (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas DP (2015) Convex Optimization Algorithms (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsimas D, Mišić VV (2016) Decomposable Markov decision processes: A fluid optimization approach. Oper. Res. 64(6):1537–1555.LinkGoogle Scholar
  • Bertsimas D, Orlin JB (1994) A technique for speeding up the solution of the Lagrangian dual. Math. Programming 63(1–3):23–45.CrossrefGoogle Scholar
  • Bhatnagar S, Lakshmanan K (2012) An online actor–critic algorithm with function approximation for constrained Markov decision processes. J. Optim. Theory Appl. 153(3):688–708.CrossrefGoogle Scholar
  • Borkar VS (2005) An actor-critic algorithm for constrained Markov decision processes. Systems Control Lett. 54(3):207–213.CrossrefGoogle Scholar
  • Boutilier C, Lu T (2016) Budget allocation using weakly coupled, constrained Markov decision processes. Proc. 32nd Conf. Uncertainty Artificial Intelligence (AUAI Press, Jersey City, NJ).Google Scholar
  • Boyd S, Boyd SP, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Brown DB, Smith JE (2020) Index policies and performance bounds for dynamic selection problems. Management Sci. 66(7):3029–3050.LinkGoogle Scholar
  • Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, vol. 8 (Now Publishers Inc., Hanover, MA), 231–357.CrossrefGoogle Scholar
  • Caramanis C, Dimitrov NB, Morton DP (2014) Efficient algorithms for budget-constrained Markov decision processes. IEEE Trans. Automatic Control 59(10):2813–2817.CrossrefGoogle Scholar
  • Chen Y, Wang M (2016) Stochastic primal-dual methods and sample complexity of reinforcement learning. Preprint, submitted December 8, https://arxiv.org/abs/1612.02516.Google Scholar
  • Chen J, Dong J, Shi P (2020) A survey on skill-based routing with applications to service operations management. Queueing Systems 96:53–82.CrossrefGoogle Scholar
  • Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learn. Res. 18(1):6070–6120.Google Scholar
  • Chow Y, Nachum O, Duenez-Guzman E, Ghavamzadeh M (2018) A Lyapunov-based approach to safe reinforcement learning. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 8092–8101.Google Scholar
  • Dai J, Gluzman M (2020) Queueing network controls via deep reinforcement learning. Preprint, submitted July 31, https://arxiv.org/abs/2008.01644.Google Scholar
  • Dai J, Lin W (2008) Asymptotic optimality of maximum pressure policies in stochastic processing networks. Ann. Appl. Probab. 18(6):2239–2299.CrossrefGoogle Scholar
  • Dai J, Shi P (2019) Inpatient overflow: An approximate dynamic programming approach. Manufacturing Service Oper. Management 21(4):894–911.LinkGoogle Scholar
  • De Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.LinkGoogle Scholar
  • Ding D, Jovanović MR (2022) Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces. 2022 IEEE 61st Conf. Decision Control (IEEE, Piscataway, NJ), 4892–4897.Google Scholar
  • Dong J, Shi P, Zheng F, Jin X (2019) Off-service placement in inpatient ward network: Resource pooling versus service slowdown. Working paper, Columbia Business School, New York.Google Scholar
  • Du S, Lee J, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn. (PMLR, New York), 1339–1348.Google Scholar
  • Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Learn. Dynam. Control (PMLR, New York), 486–489.Google Scholar
  • Farahmand A, Ghavamzadeh M, Szepesvári C, Mannor S (2016) Regularized policy iteration with nonparametric function spaces. J. Machine Learn. Res. 17(1):4809–4874.Google Scholar
  • Gattami A (2019) Reinforcement learning for multi-objective and constrained Markov decision processes. Preprint, submitted January 23, https://arxiv.org/abs/1901.08978.Google Scholar
  • Gijsbrechts J, Boute RN, Van Mieghem JA, Zhang D (2022) Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.LinkGoogle Scholar
  • Girard C, Green LV, Lewis ME, Xie J (2023) Two-class constrained optimization with applications to queueing control. Naval Res. Logist. 75(3):397–422.CrossrefGoogle Scholar
  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 249–256.Google Scholar
  • Gu H, Guo X, Wei X, Xu R (2021) Mean-field multi-agent reinforcement learning: A decentralized network approach. Preprint, submitted August 5, https://arxiv.org/abs/2108.02731.Google Scholar
  • Hayes CF, Rădulescu R, Bargiacchi E, Källström J, Macfarlane M, Reymond M, Verstraeten T, et al. (2022) A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents Multi-Agent Systems 36(1):1–59.CrossrefGoogle Scholar
  • Iglehart DL (1963) Optimality of (s, s) policies in the infinite horizon dynamic inventory problem. Management Sci. 9(2):259–267.LinkGoogle Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2 (Morgan Kaufmann Publishers Inc., San Francisco, CA), 267–274.Google Scholar
  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
  • Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
  • Lan G (2023) Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Math. Programming 198(1):1059–1106.CrossrefGoogle Scholar
  • Le HM, Voloshin C, Yue Y (2019) Batch policy learning under constraints. Preprint, submitted March 20, https://arxiv.org/abs/1903.08738.Google Scholar
  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444.CrossrefGoogle Scholar
  • Lee D, He N (2019) Stochastic primal-dual Q-learning algorithm for discounted MDPs. 2019 Amer. Control Conf. (IEEE, Piscataway, NJ), 4897–4902.Google Scholar
  • Lin Q, Nadarajah S, Soheili N (2020) Revisiting approximate linear programming: Constraint-violation learning with applications to inventory control and energy storage. Management Sci. 66(4):1544–1562.LinkGoogle Scholar
  • Liu B, Cai Q, Yang Z, Wang Z (2019) Neural proximal/trust region policy optimization attains globally optimal policy. Preprint, submitted June 25, https://arxiv.org/abs/1906.10306.Google Scholar
  • Mandelbaum A, Stolyar AL (2004) Scheduling flexible servers with convex delay costs: Heavy-traffic optimality of the generalized cμ-rule. Oper. Res. 52(6):836–855.LinkGoogle Scholar
  • Mannor S, Shimkin N (2001) The steering approach for multi-criteria reinforcement learning. Dietterich T, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
  • Miryoosefi S, Brantley K, Daume H III, Dudik M, Schapire RE (2019) Reinforcement learning with convex constraints. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 14093–14102.Google Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
  • Moallemi CC, Kumar S, Van Roy B (2008) Approximate and data-driven dynamic programming for queueing networks. Unpublished manuscript.Google Scholar
  • Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
  • Nadarajah S, Margot F, Secomandi N (2015) Relaxations of approximate linear programs for the real option management of commodity storage. Management Sci. 61(12):3054–3076.LinkGoogle Scholar
  • Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. Proc. 22nd Internat. Conf. Machine Learn. (ACM, New York), 601–608.Google Scholar
  • Nedić A, Ozdaglar A (2009) Subgradient methods for saddle-point problems. J. Optim. Theory Appl. 142(1):205–228.CrossrefGoogle Scholar
  • Neely MJ (2011) Online fractional programming for Markov decision systems. Forty-ninth Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 353–360.Google Scholar
  • Nemirovski A (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1):229–251.CrossrefGoogle Scholar
  • Nocedal J, Wright SJ (1999) Numerical Optimization (Springer, New York).CrossrefGoogle Scholar
  • Oroojlooyjadid A, Nazari M, Snyder LV, Takáč M (2022) A deep Q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.LinkGoogle Scholar
  • Pakiman P, Nadarajah S, Soheili N, Lin Q (2020) Self-guided approximate linear programs. Preprint, submitted January 9, https://arxiv.org/abs/2001.02798.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Qi M, Shi Y, Qi Y, Ma C, Yuan R, Wu D, Shen ZJ (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.LinkGoogle Scholar
  • Sam T, Chen Y, Yu CL (2022) Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure. Preprint, submitted June 7, https://arxiv.org/abs/2206.03569.Google Scholar
  • Scarf H (1960) The optimality of (S, s) policies in the dynamic inventory problem. Sheshinski E, Wiess Y, eds. Optimal Pricing, Inflation, and the Cost of Price Adjustment (The MIT Press, Cambridge, MA), 49–56.Google Scholar
  • Schweitzer PJ, Seidmann A (1985) Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. 110(2):568–582.CrossrefGoogle Scholar
  • Singh SP, Cohn D (1998) How to dynamically merge Markov decision processes. Jordan M, Kearns M, Solla S, eds. Advances in Neural Information Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
  • Singh R, Kumar P (2018) Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: Unreliable links. IEEE Trans. Automatic Control 64(1):127–142.CrossrefGoogle Scholar
  • Sion M (1958) On general minimax theorems. Pacific J. Math. 8(1):171–176.CrossrefGoogle Scholar
  • Song H, Tucker AL, Graue R, Moravick S, Yang JJ (2020) Capacity pooling in hospitals: The hidden consequences of off-service placement. Management Sci. 66(9):3825–3842.LinkGoogle Scholar
  • Sun P, Wang K, Zipkin P (2014) Quadratic approximation of cost functions in lost sales and perishable inventory control problems. Unpublished manuscript.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Tesauro G, Das R, Chan H, Kephart J, Levine D, Rawson F, Lefurgy C (2007) Managing power consumption and performance of computing systems using reinforcement learning. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Tessler C, Mankowitz DJ, Mannor S (2018) Reward constrained policy optimization. Preprint, submitted May 28, https://arxiv.org/abs/1805.11074.Google Scholar
  • Tiapkin D, Gasnikov A (2022) Primal-dual stochastic mirror descent for mdps. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9723–9740.Google Scholar
  • Topaloglu H (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.LinkGoogle Scholar
  • Topaloglu H, Kunnumkal S (2006) Approximate dynamic programming methods for an inventory allocation problem under uncertainty. Naval Res. Logist. 53(8):822–841.CrossrefGoogle Scholar
  • Turken N, Tan Y, Vakharia AJ, Wang L, Wang R, Yenipazarli A (2012) The multi-product newsvendor problem: Review, extensions, and directions for future research. Handbook of Newsvendor Problems (Springer, New York), 3–39.CrossrefGoogle Scholar
  • Van Roy B, Bertsekas DP, Lee Y, Tsitsiklis JN (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, Piscataway, NJ), 4052–4057.Google Scholar
  • Veatch MH (2005) Approximate dynamic programming for networks: Fluid models and constraint reduction. Unpublished manuscript.Google Scholar
  • Veinott AF Jr, Wagner HM (1965) Computing optimal (s, s) inventory policies. Management Sci. 11(5):525–552.LinkGoogle Scholar
  • Yang R, Sun X, Narasimhan K (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Yu H, Neely M, Wei X (2017) Online convex optimization with stochastic constraints. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Zhang X, Frazier PI (2022) Near-optimality for infinite-horizon restless bandits with many arms. Preprint, submitted March 29, https://arxiv.org/abs/2203.15853.Google Scholar
  • Zhou D, Chen J, Gu Q (2020) Provable multi-objective reinforcement learning with generative models. Preprint, submitted November 19, https://arxiv.org/abs/2011.10134.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.