Exploiting the Structural Properties of the Underlying Markov Decision Problem in the Q-Learning Algorithm

Published Online:https://doi.org/10.1287/ijoc.1070.0240

References

  • Andradottir S. A stochastic approximation algorithm with varying bounds. Oper. Res. (1995) 43:1037–1048LinkGoogle Scholar
  • Barto A. G., Bradtke S. J., Singh S. P. Learning to act using real-time dynamic programming. Artificial Intelligence (1995) 72:81–138CrossrefGoogle Scholar
  • Bertsekas D. P., Tsitsiklis J. N.Neuro-Dynamic Programming (1996) (Athena Scientific, Belmont, MA) Google Scholar
  • de Farias D. P., Van Roy B. The linear programming approach to approximate dynamic programming. Oper. Res. (2003) 51:850–865LinkGoogle Scholar
  • Deb R. K., Serfozo R. F. Optimal control of batch service queues. Adv. Appl. Probab. (1973) 5:340–361CrossrefGoogle Scholar
  • Ding X. Estimation and optimization in discrete inventory models. (2002) . Ph.D. thesis, The University of British Columbia, VancouverGoogle Scholar
  • Ignall E., Kolesar P. Operating characteristics of a simple shuttle under local dispatching rules. Oper. Res. (1972) 20:1077–1088LinkGoogle Scholar
  • Ignall E., Kolesar P. Operating characteristics of an infinite capacity shuttle: Control at a single terminal. Oper. Res. (1974) 22:1008–1024LinkGoogle Scholar
  • Kosten L.Stochastic Theory of Service Systems (1973) (Pergamon Press, New York) Google Scholar
  • Kushner H. J., Clark D. S.Stochastic Approximation Methods for Constrained and Unconstrained Systems (1978) (Springer-Verlag, Berlin) CrossrefGoogle Scholar
  • Ljung L. Analysis of recursive stochastic algorithms. IEEE Trans. Automatic Control (1977) 22:551–575CrossrefGoogle Scholar
  • Papadaki K., Powell W. B. Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. (2002) 142:108–127CrossrefGoogle Scholar
  • Papadaki K., Powell W. B. An adaptive dynamic programming algorithm for a stochastic multiproduct batch dispatch problem. Naval Res. Logist. (2003) 50:742–769CrossrefGoogle Scholar
  • Powell W. B., Ruszczynski A., Topaloglu H. Learning algorithms for separable approximations of stochastic optimization problems. Math. Oper. Res. (2004) 29:814–836LinkGoogle Scholar
  • Puterman M. L.Markov Decision Processes (1994) (John Wiley & Sons, New York) CrossrefGoogle Scholar
  • Schweitzer P., Seidmann A. Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. (1985) 110:568–582CrossrefGoogle Scholar
  • Si J., Barto A. G., Powell W. B., Wunsch D.Handbook of Learning and Approximate Dynamic Programming (2004) (Wiley-Interscience, Piscataway, NJ) CrossrefGoogle Scholar
  • Sutton R. S., Barto A. G.Reinforcement Learning (1998) (The MIT Press, Cambridge, MA) Google Scholar
  • Topaloglu H., Powell W. B. Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS J. Comput. (2006) 18:31–42LinkGoogle Scholar
  • Tsitsiklis J. N. Asynchronous stochastic approximation and Q-learning. Machine Learn. (1994) 16:185–202CrossrefGoogle Scholar
  • Tsitsiklis J. N., Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control (1997) 42:674–690CrossrefGoogle Scholar
  • Watkins C. J. C. H. Learning from delayed rewards. (1989) . Ph.D. thesis, Cambridge University, Cambridge, UKGoogle Scholar
  • Watkins C. J. C. H., Dayan P. Q-learning. Machine Learn. (1992) 8:279–292CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.