A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management
References
- (2008) Relaxations of weakly coupled stochastic dynamic programs. Oper. Res. 56(3):712–727.Link, Google Scholar
- (1994) Denumerable constrained Markov decision processes and finite approximations. Math. Oper. Res. 19(1):169–191.Link, Google Scholar
- (1999) Constrained Markov Decision Processes, vol. 7 (Chapman and Hall/CRC, Boca Raton, FL).Google Scholar
- (1991) Markov decision problems and state-action frequencies. SIAM J. Control Optim. 29(4):786–809.Crossref, Google Scholar
- (2007) Fitted Q-iteration in continuous action-space MDPs. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer, New York).Crossref, Google Scholar
- Balseiro SR, Brown DB, Chen C (2021) Dynamic pricing of relocating resources in large networks. Management Sci. 67(7):4075–4094.Google Scholar
- (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. ii (Athena Scientific, Belmont, MA).Google Scholar
- (2015) Convex Optimization Algorithms (Athena Scientific, Belmont, MA).Google Scholar
- (2016) Decomposable Markov decision processes: A fluid optimization approach. Oper. Res. 64(6):1537–1555.Link, Google Scholar
- (1994) A technique for speeding up the solution of the Lagrangian dual. Math. Programming 63(1–3):23–45.Crossref, Google Scholar
- (2012) An online actor–critic algorithm with function approximation for constrained Markov decision processes. J. Optim. Theory Appl. 153(3):688–708.Crossref, Google Scholar
- (2005) An actor-critic algorithm for constrained Markov decision processes. Systems Control Lett. 54(3):207–213.Crossref, Google Scholar
- (2016) Budget allocation using weakly coupled, constrained Markov decision processes. Proc. 32nd Conf. Uncertainty Artificial Intelligence (AUAI Press, Jersey City, NJ).Google Scholar
- (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2020) Index policies and performance bounds for dynamic selection problems. Management Sci. 66(7):3029–3050.Link, Google Scholar
- (2015) Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, vol. 8 (Now Publishers Inc., Hanover, MA), 231–357.Crossref, Google Scholar
- (2014) Efficient algorithms for budget-constrained Markov decision processes. IEEE Trans. Automatic Control 59(10):2813–2817.Crossref, Google Scholar
- (2016) Stochastic primal-dual methods and sample complexity of reinforcement learning. Preprint, submitted December 8, https://arxiv.org/abs/1612.02516.Google Scholar
- (2020) A survey on skill-based routing with applications to service operations management. Queueing Systems 96:53–82.Crossref, Google Scholar
- (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learn. Res. 18(1):6070–6120.Google Scholar
- (2018) A Lyapunov-based approach to safe reinforcement learning. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 8092–8101.Google Scholar
- (2020) Queueing network controls via deep reinforcement learning. Preprint, submitted July 31, https://arxiv.org/abs/2008.01644.Google Scholar
- (2008) Asymptotic optimality of maximum pressure policies in stochastic processing networks. Ann. Appl. Probab. 18(6):2239–2299.Crossref, Google Scholar
- (2019) Inpatient overflow: An approximate dynamic programming approach. Manufacturing Service Oper. Management 21(4):894–911.Link, Google Scholar
- (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.Link, Google Scholar
- (2022) Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces. 2022 IEEE 61st Conf. Decision Control (IEEE, Piscataway, NJ), 4892–4897.Google Scholar
- (2019) Off-service placement in inpatient ward network: Resource pooling versus service slowdown. Working paper, Columbia Business School, New York.Google Scholar
- (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn. (PMLR, New York), 1339–1348.Google Scholar
- (2020) A theoretical analysis of deep q-learning. Learn. Dynam. Control (PMLR, New York), 486–489.Google Scholar
- (2016) Regularized policy iteration with nonparametric function spaces. J. Machine Learn. Res. 17(1):4809–4874.Google Scholar
- (2019) Reinforcement learning for multi-objective and constrained Markov decision processes. Preprint, submitted January 23, https://arxiv.org/abs/1901.08978.Google Scholar
- (2022) Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing Service Oper. Management 24(3):1349–1368.Link, Google Scholar
- (2023) Two-class constrained optimization with applications to queueing control. Naval Res. Logist. 75(3):397–422.Crossref, Google Scholar
- (2010) Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 249–256.Google Scholar
- (2021) Mean-field multi-agent reinforcement learning: A decentralized network approach. Preprint, submitted August 5, https://arxiv.org/abs/2108.02731.Google Scholar
- (2022) A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents Multi-Agent Systems 36(1):1–59.Crossref, Google Scholar
- (1963) Optimality of (s, s) policies in the infinite horizon dynamic inventory problem. Management Sci. 9(2):259–267.Link, Google Scholar
- (2002) Approximately optimal approximate reinforcement learning. ICML, vol. 2 (Morgan Kaufmann Publishers Inc., San Francisco, CA), 267–274.Google Scholar
- (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
- (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
- (2023) Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Math. Programming 198(1):1059–1106.Crossref, Google Scholar
- (2019) Batch policy learning under constraints. Preprint, submitted March 20, https://arxiv.org/abs/1903.08738.Google Scholar
- (2015) Deep learning. Nature 521(7553):436–444.Crossref, Google Scholar
- (2019) Stochastic primal-dual Q-learning algorithm for discounted MDPs. 2019 Amer. Control Conf. (IEEE, Piscataway, NJ), 4897–4902.Google Scholar
- (2020) Revisiting approximate linear programming: Constraint-violation learning with applications to inventory control and energy storage. Management Sci. 66(4):1544–1562.Link, Google Scholar
- (2019) Neural proximal/trust region policy optimization attains globally optimal policy. Preprint, submitted June 25, https://arxiv.org/abs/1906.10306.Google Scholar
- (2004) Scheduling flexible servers with convex delay costs: Heavy-traffic optimality of the generalized cμ-rule. Oper. Res. 52(6):836–855.Link, Google Scholar
- (2001) The steering approach for multi-criteria reinforcement learning. Dietterich T, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems, vol. 14 (MIT Press, Cambridge, MA).Google Scholar
- (2019) Reinforcement learning with convex constraints. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 14093–14102.Google Scholar
- (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
- (2008) Approximate and data-driven dynamic programming for queueing networks. Unpublished manuscript.Google Scholar
- (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(27):815–857.Google Scholar
- (2015) Relaxations of approximate linear programs for the real option management of commodity storage. Management Sci. 61(12):3054–3076.Link, Google Scholar
- (2005) Dynamic preferences in multi-criteria reinforcement learning. Proc. 22nd Internat. Conf. Machine Learn. (ACM, New York), 601–608.Google Scholar
- (2009) Subgradient methods for saddle-point problems. J. Optim. Theory Appl. 142(1):205–228.Crossref, Google Scholar
- (2011) Online fractional programming for Markov decision systems. Forty-ninth Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 353–360.Google Scholar
- (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1):229–251.Crossref, Google Scholar
- (1999) Numerical Optimization (Springer, New York).Crossref, Google Scholar
- (2022) A deep Q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing Service Oper. Management 24(1):285–304.Link, Google Scholar
- (2020) Self-guided approximate linear programs. Preprint, submitted January 9, https://arxiv.org/abs/2001.02798.Google Scholar
- (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2023) A practical end-to-end inventory management model with deep learning. Management Sci. 69(2):759–773.Link, Google Scholar
- (2022) Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure. Preprint, submitted June 7, https://arxiv.org/abs/2206.03569.Google Scholar
- Scarf H (1960) The optimality of (S, s) policies in the dynamic inventory problem. Sheshinski E, Wiess Y, eds. Optimal Pricing, Inflation, and the Cost of Price Adjustment (The MIT Press, Cambridge, MA), 49–56.Google Scholar
- (1985) Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. 110(2):568–582.Crossref, Google Scholar
- (1998) How to dynamically merge Markov decision processes. Jordan M, Kearns M, Solla S, eds. Advances in Neural Information Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
- (2018) Throughput optimal decentralized scheduling of multihop networks with end-to-end deadline constraints: Unreliable links. IEEE Trans. Automatic Control 64(1):127–142.Crossref, Google Scholar
- (1958) On general minimax theorems. Pacific J. Math. 8(1):171–176.Crossref, Google Scholar
- (2020) Capacity pooling in hospitals: The hidden consequences of off-service placement. Management Sci. 66(9):3825–3842.Link, Google Scholar
- (2014) Quadratic approximation of cost functions in lost sales and perishable inventory control problems. Unpublished manuscript.Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (2007) Managing power consumption and performance of computing systems using reinforcement learning. Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems, vol. 20 (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2018) Reward constrained policy optimization. Preprint, submitted May 28, https://arxiv.org/abs/1805.11074.Google Scholar
- (2022) Primal-dual stochastic mirror descent for mdps. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9723–9740.Google Scholar
- (2009) Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Oper. Res. 57(3):637–649.Link, Google Scholar
- (2006) Approximate dynamic programming methods for an inventory allocation problem under uncertainty. Naval Res. Logist. 53(8):822–841.Crossref, Google Scholar
- (2012) The multi-product newsvendor problem: Review, extensions, and directions for future research. Handbook of Newsvendor Problems (Springer, New York), 3–39.Crossref, Google Scholar
- (1997) A neuro-dynamic programming approach to retailer inventory management. Proc. 36th IEEE Conf. Decision Control, vol. 4 (IEEE, Piscataway, NJ), 4052–4057.Google Scholar
- (2005) Approximate dynamic programming for networks: Fluid models and constraint reduction. Unpublished manuscript.Google Scholar
- (1965) Computing optimal (s, s) inventory policies. Management Sci. 11(5):525–552.Link, Google Scholar
- (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2017) Online convex optimization with stochastic constraints. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2022) Near-optimality for infinite-horizon restless bandits with many arms. Preprint, submitted March 29, https://arxiv.org/abs/2203.15853.Google Scholar
- (2020) Provable multi-objective reinforcement learning with generative models. Preprint, submitted November 19, https://arxiv.org/abs/2011.10134.Google Scholar

