A First-Order Approach to Accelerated Value Iteration

Vineet Goyal
Vineet Goyal
[email protected]
https://orcid.org/0000-0001-6719-3212
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027;
Search for more papers by this author
,
Julien Grand-Clément
Corresponding Author
Julien Grand-Clément
[email protected]
https://orcid.org/0000-0002-2864-8779
Department of Information Systems and Operations Management, Ecole des hautes études commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France
Search for more papers by this author

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027;

Corresponding Author

Julien Grand-Clément

Department of Information Systems and Operations Management, Ecole des hautes études commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France

Search for more papers by this author

Published Online:24 Mar 2022https://doi.org/10.1287/opre.2022.2269

References

Akian M, Gaubert S, Grand-Clément J, Guillaud J (2019) The operator approach to entropy games. Theory Comput. Systems 63(5):1089–1130.Crossref, Google Scholar
Archibald TW, McKinnon KIM, Thomas LC (1995) On the generation of Markov decision processes. J. Oper. Res. Soc. 46(3):354–361.Crossref, Google Scholar
Asadi K, Littman ML (2017) An alternative softmax operator for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR), 243–252.Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183–202.Crossref, Google Scholar
Blondel VD, Nesterov Y (2005) Computationally efficient approximations of the joint spectral radius. SIAM J. Matrix Anal. Appl. 27(1):256–272.Crossref, Google Scholar
Boucherie RJ, Van Dijk NM, eds. (2017) Markov Decision Processes in Practice (Springer, Cham, Switzerland).Crossref, Google Scholar
Brémaud P (2013) Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, vol. 31 (Springer Science & Business Media, New York).Google Scholar
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
Cogill R, Peng C (2013) Reversible Markov decision processes with an average-reward criterion. SIAM J. Control Optim. 51(1):402–418.Crossref, Google Scholar
Cordwell S, Gonzalez Y, Tulabandhula T (2015) Markov Decision Process (MDP) toolbox for Python. Accessed January 4, 2022, https://github.com/sawcordwell/pymdptoolbox.Google Scholar
Cvetković A, Protasov VY (2020) The greedy strategy for optimizing the Perron eigenvalue. Math. Programming, 10.1007/s10107-020-01585-z.Crossref, Google Scholar
de Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.Link, Google Scholar
Filar JA, Tolwinski B (1991) On the algorithm of Pollatschek and Avi-ltzhak. Raghaven TES, Ferguson TS, Parthasarathy T, Vrieze OJ, eds. Stochastic Games and Related Topics: In Honor of Professor L. S. Shapley (Springer, Cham, Switzerland), 59–70.Crossref, Google Scholar
Geist M, Scherrer B (2018) Anderson acceleration for reinforcement learning. Preprint, submitted September 25, https://arxiv.org/abs/1809.09501.Google Scholar
Ghadimi E, Feyzmahdavian HR, Johansson M (2015) Global convergence of the heavy-ball method for convex optimization. 2015 Eur. Control Conf. (ECC) (IEEE), 310–315.Google Scholar
Goh J, Bayati M, Stefanos A Zenios SS, Moore D (2018) Data uncertainty in Markov chains: Application to cost-effectiveness analyses of medical innovations. Oper. Res. 66(3):697–715.Link, Google Scholar
Goldstein T, O’Donoghue B, Setzer S, Baraniuk R (2014) Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3):1588–1623.Crossref, Google Scholar
Goyal V, Grand-Clément J (2018) Robust Markov decision process: Beyond rectangularity. Preprint, submitted November 6, https://arxiv.org/abs/1811.00215v3.Google Scholar
Grand-Clément J (2021) From convex optimization to MDPs: A review of first-order, second-order and quasi-Newton methods for MDPs. Preprint, submitted November 25, https://arxiv.org/abs/2104.10677.Google Scholar
Grand-Clément J, Chan CW, Goyal V, Escobar G (2020) Robust policies for proactive ICU transfers. Preprint, submitted February 14, https://arxiv.org/abs/2002.06247v1.Google Scholar
Herzberg M, Yechiali U (1996) A k-step look-ahead analysis of value iteration algorithms for Markov decision processes. Eur. J. Oper. Res. 88(3):622–636.Crossref, Google Scholar
Ho CP, Petrik M, Wiesemann W (2018) Fast Bellman updates for robust MDPs. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (ICML), vol. 80 (PMLR), 1979–1988.Google Scholar
Iutzeler F, Hendrickx JM (2019) A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optim. Methods Software 34(2):383–405.Crossref, Google Scholar
Iyengar G (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.Link, Google Scholar
Jian Q, Fruit R, Pirotta M, Lazaric A (2019) Exploration bonus for regret minimization in discrete and continuous average-reward MDPs. Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc.), 4890–4899.Google Scholar
Jungers R (2009) The Joint Spectral Radius: Theory and Applications, vol. 385 (Springer Science & Business Media, New York).Crossref, Google Scholar
Krasnoselskii M (1955) Two observations about the method of successive approximations. Uspekhi Mat. Nauk 10:123–127.Google Scholar
Kushner H, Kleinman A (1971) Accelerated procedures for the solution of discrete Markov control problems. IEEE Trans. Automatic Control 16(2):147–152.Crossref, Google Scholar
Lee JM (2013) Introduction to Smooth Manifolds (Springer, Cham, Switzerland).Google Scholar
Nesterov Y (1983) A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR 269:543–547.Google Scholar
Nesterov Y (2013) Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer Science & Business Media, New York).Google Scholar
O’Donoghue B, Candes E (2015) Adaptive restart for accelerated gradient schemes. Foundations Comput. Math. 15(3):715–732.Crossref, Google Scholar
O’Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2016) Combining policy gradient and Q-learning. arXiv preprint arXiv: 1611.01626.Google Scholar
Petrik M (2010) Optimization-Based Approximate Dynamic Programming (University of Massachusetts, Amherst, MA).Google Scholar
Piot B, Geist M, Pietquin O (2016) Difference of convex functions programming applied to control with expert data. Preprint, submitted November 5, https://arxiv.org/abs/1611.01626v1.Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.Crossref, Google Scholar
Porteus EL, Totten JC (1978) Accelerated computation of the expected discounted return in a Markov chain. Oper. Res. 26(2):350–358.Link, Google Scholar
Possingham H, Tuck G (1997) Application of stochastic dynamic programming to optimal fire management of a spatially structured threatened species. Proc. Internat. Congress Model. Simulation, MODSIM, vol. 2, 813–817.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
Puterman ML, Brumelle SL (1979) On the convergence of policy iteration in stationary dynamic programming. Math. Oper. Res. 4(1):60–69.Link, Google Scholar
Scherrer B (2016) Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 41(3):758–774.Link, Google Scholar
Scherrer B, Ghavamzadeh M, Gabillon V, Lesner B, Geist M (2015) Approximate modified policy iteration and its application to the game of Tetris. J. Machine Learn. Res. 16(49):1629–1676.Google Scholar
Shlakhter O, Lee C-G, Khmelev D, Jaber N (2010) Acceleration operators in the value iteration algorithms for Markov decision processes. Oper. Res. 58(1):193–202.Link, Google Scholar
Sidford A, Wang M, Wu X, Ye Y (2018) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. Twenty-Ninth Annual ACM-SIAM Sympos. Discrete Algorithms (SIAM), 770–787.Google Scholar
Tang Y, Rowland M, Munos R, Valko M (2021) Taylor expansion of discount factors. Preprint, submitted June 14, https://arxiv.org/abs/2106.06170.Google Scholar
Tarbouriech J, Lazaric A (2019) Active exploration in Markov decision processes. Chaudhuri K, Masashi S, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR), 974–982.Google Scholar
Tetali P (1991) Random walks and the effective resistance of networks. J. Theoret. Probab. 4(1):101–109.Crossref, Google Scholar
Thodoroff P, Durand A, Pineau J, Precup D (2018) Temporal regularization for Markov decision process. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc.), 1779–1789.Google Scholar
Tsitsiklis JN, Blondel VD (1997) The Lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate. Math. Control Signals Systems 10(1):31–40.Crossref, Google Scholar
Valentine FA (1945) A Lipschitz condition preserving extension for a vector function. Amer. J. Math. 67(1):83–93.Crossref, Google Scholar
Vieillard N, Pietquin O, Geist M (2019) On connections between constrained optimization and reinforcement learning. Adv. Neural Inform. Processing Systems 2019—Optim. Foundations Reinforcement Learn. Workshop.Google Scholar
Wainwright MJ (2019) Stochastic approximation with cone-contractive operators: Sharp L∞-bounds for Q-learning. Preprint, submitted June 24, https://arxiv.org/abs/1905.06265.Google Scholar
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Oper. Res. 38(1):153–183.Abstract, Google Scholar
Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4):593–603.Link, Google Scholar
Zhang J, O’Donoghue B, Boyd S (2020) Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM J. Optim. 30(4):3170–3197.Crossref, Google Scholar

Volume 71, Issue 2

March-April 2023

Pages iii-vi, 397-790, C2-C3

Article Information

Metrics

Information

Received:June 30, 2020
Accepted:December 30, 2021
Published Online:March 24, 2022

Cite as

Vineet Goyal, Julien Grand-Clément (2022) A First-Order Approach to Accelerated Value Iteration. Operations Research 71(2):517-535.

https://doi.org/10.1287/opre.2022.2269

Keywords

Acknowledgments

The authors thank an anonymous associate editor and three referees for the detailed comments that helped improve the manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A First-Order Approach to Accelerated Value Iteration

References

Volume 71, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News