Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

Shipra Agrawal
Shipra Agrawal
[email protected]
https://orcid.org/0000-0003-4486-3871
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author
,
Randy Jia
Randy Jia
[email protected]
https://orcid.org/0000-0002-7101-9572
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Randy Jia

[email protected]

https://orcid.org/0000-0002-7101-9572

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:25 Mar 2022https://doi.org/10.1287/opre.2022.2263

References

Agarwal A, Foster DP, Hsu DJ, Kakade SM, Rakhlin A (2011) Stochastic convex optimization with bandit feedback. Taylor JS, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2011), Granada, Spain, 1035–1043.Google Scholar
Agrawal S, Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 30 (NIPS 2017, Long Beach, CA), 1184–1194.Google Scholar
Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Bilmes JA, Ng AY, eds. Proc. Twenty-Fifth Conf. Uncertainty Artificial Intelligence, Montreal, QC, Canada (AUAI Press, Arlington, VA), 35–42.Google Scholar
Bartók G, Foster DP, Pál D, Rakhlin A, Szepesvári C (2014) Partial monitoring—classification, regret bounds, and algorithms. Math. Oper. Res. 39(4):967–997.Link, Google Scholar
Besbes O, Muharremoglu A (2013) On implications of demand censoring in the newsvendor problem. Management Sci. 59(6):1407–1424.Link, Google Scholar
Besbes O, Gur Y, Zeevi A (2015) Non-stationary stochastic optimization. Oper. Res. 63(5):1227–1244.Link, Google Scholar
Bijvank M, Vis IF (2011) Lost-sales inventory theory: A review. Eur. J. Oper. Res. 215(1):1–13.Crossref, Google Scholar
Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Math. Oper. Res. 34(1):103–123.Link, Google Scholar
Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009a) An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand. Math. Oper. Res. 34(2):397–416.Link, Google Scholar
Huh WT, Janakiraman G, Muckstadt JA, Rusmevichientong P (2009b) Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci. 55(3):404–420.Link, Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
Janakiraman G, Roundy RO (2004) Lost-sales problems with stochastic lead times: Convexity results for base-stock policies. Oper. Res. 52(5):795–803.Link, Google Scholar
Lee HL, Cohen MA (1983) A note on the convexity of performance measures of m/m/c queueing systems. J. Appl. Probab. 20(4):920–923.Crossref, Google Scholar
Lugosi G, Markakis MG, Neu G (2017) On the hardness of inventory management with censored demand data. Preprint, submitted October 16, https://arxiv.org/abs/1710.05739.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
Shanthikumar JG, Yao DD (1987) Optimal server allocation in a system of multi-server stations. Management Sci. 33(9):1173–1180.Link, Google Scholar
Tewari A, Bartlett PL (2008) Optimistic linear programming gives logarithmic regret for irreducible MDPs. Platt JC, Koller D, Singer Y, Roweis ST, eds. Proc. Twenty-First Annual Conf. Adv. Neural Inform. Processing Systems (NIPS 2007, Vancouver, British Columbia, Canada) (Curran Associates, Inc.), 1505–1512.Google Scholar
Weber RR (1980) Note—On the marginal benefit of adding servers to g/gi/m queues. Management Sci. 26(9):946–951.Link, Google Scholar
Zhang H, Chao X, Shi C (2020) Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Sci. 66(5):1962–1980.Link, Google Scholar
Zipkin P (2000) Foundations of Inventory Management (McGraw-Hill, Boston).Google Scholar
Zipkin P (2008) Old and new methods for lost-sales inventory systems. Oper. Res. 56(5):1256–1263.Link, Google Scholar

Volume 70, Issue 3

May-June 2022

Pages iii-viii, 1293-1952, C2-C3

Article Information

Metrics

Information

Received:July 31, 2019
Accepted:December 21, 2021
Published Online:March 25, 2022

Cite as

Shipra Agrawal, Randy Jia (2022) Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management. Operations Research 70(3):1646-1664.

https://doi.org/10.1287/opre.2022.2263

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

References

Volume 70, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News