A First-Order Approach to Accelerated Value Iteration

Published Online:https://doi.org/10.1287/opre.2022.2269

References

  • Akian M, Gaubert S, Grand-Clément J, Guillaud J (2019) The operator approach to entropy games. Theory Comput. Systems 63(5):1089–1130.CrossrefGoogle Scholar
  • Archibald TW, McKinnon KIM, Thomas LC (1995) On the generation of Markov decision processes. J. Oper. Res. Soc. 46(3):354–361.CrossrefGoogle Scholar
  • Asadi K, Littman ML (2017) An alternative softmax operator for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR), 243–252.Google Scholar
  • Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183–202.CrossrefGoogle Scholar
  • Blondel VD, Nesterov Y (2005) Computationally efficient approximations of the joint spectral radius. SIAM J. Matrix Anal. Appl. 27(1):256–272.CrossrefGoogle Scholar
  • Boucherie RJ, Van Dijk NM, eds. (2017) Markov Decision Processes in Practice (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • Brémaud P (2013) Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, vol. 31 (Springer Science & Business Media, New York).Google Scholar
  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
  • Cogill R, Peng C (2013) Reversible Markov decision processes with an average-reward criterion. SIAM J. Control Optim. 51(1):402–418.CrossrefGoogle Scholar
  • Cordwell S, Gonzalez Y, Tulabandhula T (2015) Markov Decision Process (MDP) toolbox for Python. Accessed January 4, 2022, https://github.com/sawcordwell/pymdptoolbox.Google Scholar
  • Cvetković A, Protasov VY (2020) The greedy strategy for optimizing the Perron eigenvalue. Math. Programming, 10.1007/s10107-020-01585-z.CrossrefGoogle Scholar
  • de Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.LinkGoogle Scholar
  • Filar JA, Tolwinski B (1991) On the algorithm of Pollatschek and Avi-ltzhak. Raghaven TES, Ferguson TS, Parthasarathy T, Vrieze OJ, eds. Stochastic Games and Related Topics: In Honor of Professor L. S. Shapley (Springer, Cham, Switzerland), 59–70.CrossrefGoogle Scholar
  • Geist M, Scherrer B (2018) Anderson acceleration for reinforcement learning. Preprint, submitted September 25, https://arxiv.org/abs/1809.09501.Google Scholar
  • Ghadimi E, Feyzmahdavian HR, Johansson M (2015) Global convergence of the heavy-ball method for convex optimization. 2015 Eur. Control Conf. (ECC) (IEEE), 310–315.Google Scholar
  • Goh J, Bayati M, Stefanos A Zenios SS, Moore D (2018) Data uncertainty in Markov chains: Application to cost-effectiveness analyses of medical innovations. Oper. Res. 66(3):697–715.LinkGoogle Scholar
  • Goldstein T, O’Donoghue B, Setzer S, Baraniuk R (2014) Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3):1588–1623.CrossrefGoogle Scholar
  • Goyal V, Grand-Clément J (2018) Robust Markov decision process: Beyond rectangularity. Preprint, submitted November 6, https://arxiv.org/abs/1811.00215v3.Google Scholar
  • Grand-Clément J (2021) From convex optimization to MDPs: A review of first-order, second-order and quasi-Newton methods for MDPs. Preprint, submitted November 25, https://arxiv.org/abs/2104.10677.Google Scholar
  • Grand-Clément J, Chan CW, Goyal V, Escobar G (2020) Robust policies for proactive ICU transfers. Preprint, submitted February 14, https://arxiv.org/abs/2002.06247v1.Google Scholar
  • Herzberg M, Yechiali U (1996) A k-step look-ahead analysis of value iteration algorithms for Markov decision processes. Eur. J. Oper. Res. 88(3):622–636.CrossrefGoogle Scholar
  • Ho CP, Petrik M, Wiesemann W (2018) Fast Bellman updates for robust MDPs. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (ICML), vol. 80 (PMLR), 1979–1988.Google Scholar
  • Iutzeler F, Hendrickx JM (2019) A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optim. Methods Software 34(2):383–405.CrossrefGoogle Scholar
  • Iyengar G (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.LinkGoogle Scholar
  • Jian Q, Fruit R, Pirotta M, Lazaric A (2019) Exploration bonus for regret minimization in discrete and continuous average-reward MDPs. Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc.), 4890–4899.Google Scholar
  • Jungers R (2009) The Joint Spectral Radius: Theory and Applications, vol. 385 (Springer Science & Business Media, New York).CrossrefGoogle Scholar
  • Krasnoselskii M (1955) Two observations about the method of successive approximations. Uspekhi Mat. Nauk 10:123–127.Google Scholar
  • Kushner H, Kleinman A (1971) Accelerated procedures for the solution of discrete Markov control problems. IEEE Trans. Automatic Control 16(2):147–152.CrossrefGoogle Scholar
  • Lee JM (2013) Introduction to Smooth Manifolds (Springer, Cham, Switzerland).Google Scholar
  • Nesterov Y (1983) A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR 269:543–547.Google Scholar
  • Nesterov Y (2013) Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer Science & Business Media, New York).Google Scholar
  • O’Donoghue B, Candes E (2015) Adaptive restart for accelerated gradient schemes. Foundations Comput. Math. 15(3):715–732.CrossrefGoogle Scholar
  • O’Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2016) Combining policy gradient and Q-learning. arXiv preprint arXiv: 1611.01626.Google Scholar
  • Petrik M (2010) Optimization-Based Approximate Dynamic Programming (University of Massachusetts, Amherst, MA).Google Scholar
  • Piot B, Geist M, Pietquin O (2016) Difference of convex functions programming applied to control with expert data. Preprint, submitted November 5, https://arxiv.org/abs/1611.01626v1.Google Scholar
  • Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.CrossrefGoogle Scholar
  • Porteus EL, Totten JC (1978) Accelerated computation of the expected discounted return in a Markov chain. Oper. Res. 26(2):350–358.LinkGoogle Scholar
  • Possingham H, Tuck G (1997) Application of stochastic dynamic programming to optimal fire management of a spatially structured threatened species. Proc. Internat. Congress Model. Simulation, MODSIM, vol. 2, 813–817.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
  • Puterman ML, Brumelle SL (1979) On the convergence of policy iteration in stationary dynamic programming. Math. Oper. Res. 4(1):60–69.LinkGoogle Scholar
  • Scherrer B (2016) Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 41(3):758–774.LinkGoogle Scholar
  • Scherrer B, Ghavamzadeh M, Gabillon V, Lesner B, Geist M (2015) Approximate modified policy iteration and its application to the game of Tetris. J. Machine Learn. Res. 16(49):1629–1676.Google Scholar
  • Shlakhter O, Lee C-G, Khmelev D, Jaber N (2010) Acceleration operators in the value iteration algorithms for Markov decision processes. Oper. Res. 58(1):193–202.LinkGoogle Scholar
  • Sidford A, Wang M, Wu X, Ye Y (2018) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. Twenty-Ninth Annual ACM-SIAM Sympos. Discrete Algorithms (SIAM), 770–787.Google Scholar
  • Tang Y, Rowland M, Munos R, Valko M (2021) Taylor expansion of discount factors. Preprint, submitted June 14, https://arxiv.org/abs/2106.06170.Google Scholar
  • Tarbouriech J, Lazaric A (2019) Active exploration in Markov decision processes. Chaudhuri K, Masashi S, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR), 974–982.Google Scholar
  • Tetali P (1991) Random walks and the effective resistance of networks. J. Theoret. Probab. 4(1):101–109.CrossrefGoogle Scholar
  • Thodoroff P, Durand A, Pineau J, Precup D (2018) Temporal regularization for Markov decision process. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc.), 1779–1789.Google Scholar
  • Tsitsiklis JN, Blondel VD (1997) The Lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate. Math. Control Signals Systems 10(1):31–40.CrossrefGoogle Scholar
  • Valentine FA (1945) A Lipschitz condition preserving extension for a vector function. Amer. J. Math. 67(1):83–93.CrossrefGoogle Scholar
  • Vieillard N, Pietquin O, Geist M (2019) On connections between constrained optimization and reinforcement learning. Adv. Neural Inform. Processing Systems 2019—Optim. Foundations Reinforcement Learn. Workshop.Google Scholar
  • Wainwright MJ (2019) Stochastic approximation with cone-contractive operators: Sharp L∞-bounds for Q-learning. Preprint, submitted June 24, https://arxiv.org/abs/1905.06265.Google Scholar
  • Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Oper. Res. 38(1):153–183.AbstractGoogle Scholar
  • Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4):593–603.LinkGoogle Scholar
  • Zhang J, O’Donoghue B, Boyd S (2020) Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM J. Optim. 30(4):3170–3197.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.