Randomized Linear Programming Solves the Markov Decision Problem in Nearly Linear (Sometimes Sublinear) Time

Published Online:https://doi.org/10.1287/moor.2019.1000

References

  • [1] Azar MG, Munos R, Kappen HJ (2013) Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • [2] Bellman R (1957) Dynamic Programming (Princeton University Press, Princeton, NJ).Google Scholar
  • [3] Benveniste A, Métivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Berlin).Google Scholar
  • [4] Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol. 1 (Athena Scientific, Belmont, MA).Google Scholar
  • [5] Bertsekas DP (2013) Abstract Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • [6] Bertsekas DP, Tsitsiklis JN (1995) Neuro-dynamic programming: An overview. Proc. 34th IEEE Conf. Decision Control, vol. 1 (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 560–564.Google Scholar
  • [7] Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [8] Bringmann K, Panagiotou K (2012) Efficient sampling methods for discrete distributions. Czumaj A, Mehlhorn K, Pitts A, Wattenhofer R, eds. Automata, Languages, and Programming—ICALP 2012, Lecture Notes in Computer Science, vol. 7391 (Springer, Berlin, Heidelberg), 133–144.CrossrefGoogle Scholar
  • [9] Chen Y, Wang M (2016) Stochastic primal–dual methods and sample complexity of reinforcement learning. Preprint, submitted December 8, https://arxiv.org/abs/1612.02516.Google Scholar
  • [10] Chen Y, Wang M (2017) Lower bound on the computational complexity of discounted Markov decision problems. Preprint, submitted May 20, https://arxiv.org/abs/1705.07312.Google Scholar
  • [11] Clarkson KL, Hazan E, Woodruff DP (2012) Sublinear optimization for machine learning. J. ACM 59(5):23.CrossrefGoogle Scholar
  • [12] Cover TM, Thomas JA (2012) Elements of Information Theory (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • [13] Dantzig G (2016) Linear Programming and Extensions (Princeton University Press, Princeton, NJ).Google Scholar
  • [14] de Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.LinkGoogle Scholar
  • [15] De Ghellinck G (1960) Les problemes de decisions sequentielles. Cahiers du Centre d’Études de Recherche Opérationnelle 2(2):161–179.Google Scholar
  • [16] d’Epenoux F (1963) A probabilistic production and inventory problem. Management Sci. 10(1):98–108.LinkGoogle Scholar
  • [17] Feinberg EA, Huang J (2014) The value iteration algorithm is not strongly polynomial for discounted dynamic programming. Oper. Res. Lett. 42(2):130–131.CrossrefGoogle Scholar
  • [18] Howard RA (1960) Dynamic Programming and Markov Processes (MIT Press, Cambridge, MA).Google Scholar
  • [19] Juditsky A, Nemirovski A, Tauvel C (2011) Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1(1):17–58.LinkGoogle Scholar
  • [20] Kakade SM (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University of London, London.Google Scholar
  • [21] Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Kearns MJ, Solla SA, Cohn DA, eds. Advances in Neural Information Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
  • [22] Kivinen J, Warmuth MK (1997) Exponentiated gradient vs. gradient descent for linear predictors. Inform. Comput. 132(1):1–63.CrossrefGoogle Scholar
  • [23] Kushner HJ, Yin G (2003) Stochastic Approximation and Recursive Algorithms and Applications (Springer, Berlin).Google Scholar
  • [24] Lee YT, Sidford A (2014) Path finding methods for linear programming: Solving linear programs in Õ(vrank) iterations and faster algorithms for maximum flow. Proc. IEEE 55th Annual Sympos. Foundations Comput. Sci. (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 424–433.Google Scholar
  • [25] Lee YT, Sidford A (2015) Efficient inverse maintenance and faster algorithms for linear programming. Proc. IEEE 56th Annual Sympos. Foundations Comput. Sci. (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 230–249.CrossrefGoogle Scholar
  • [26] Littlestone N (1988) Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learn. 2(4):285–318.CrossrefGoogle Scholar
  • [27] LittmanML Dean TL, Kaelbling LP (1995) On the complexity of solving Markov decision problems. Proc. 11th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, Burlington, MA), 394–402.Google Scholar
  • [28] Mansour Y, Singh S (1999) On the complexity of policy iteration. Proc. 15th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, Burlington, MA), 401–408.Google Scholar
  • [29] Nemirovski A, Rubinstein RY (2005) An efficient stochastic approximation algorithm for stochastic saddle point problems. Modeling Uncertainty (Springer, Berlin), 156–184.Google Scholar
  • [30] Nemirovski A, Juditsky A, Lan G, Shapiro A (2009) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.CrossrefGoogle Scholar
  • [31] Puterman LM (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (Wiley, Hoboken, NJ).Google Scholar
  • [32] Scherrer B (2013) Improved and generalized upper bounds on the complexity of policy iteration. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 386–394.Google Scholar
  • [33] Sidford A, Wang M, Wu X, Ye Y (2018) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. 29th Annual ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 770–787.CrossrefGoogle Scholar
  • [34] Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving discounted Markov decision process with a generative model. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 5192–5202.Google Scholar
  • [35] Tseng P (1990) Solving h-horizon, stationary Markov decision problems in time proportional to log (h). Oper. Res. Lett. 9(5):287–297.CrossrefGoogle Scholar
  • [36] Wang M, Chen Y (2016) An online primal–dual method for discounted Markov decision processes. Proc. IEEE Conf. Decisions Control (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 4516–4521.CrossrefGoogle Scholar
  • [37] Wong CK, Easton MC (1980) An efficient method for weighted sampling without replacement. SIAM J. Comput. 9(1):111–113.CrossrefGoogle Scholar
  • [38] Ye Y (2005) A new complexity result on solving the Markov decision problem. Math. Oper. Res. 30(3):733–749.LinkGoogle Scholar
  • [39] Ye Y (2011) The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4):593–603.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.