Performance Loss Bounds for Approximate Value Iteration with State Aggregation

Published Online:https://doi.org/10.1287/moor.1060.0188

References

  • Axsäter S. State aggregation in dynamic programming: An application to scheduling of independent jobs on parallel processors. Oper. Res. Lett. (1983) 2:171–176CrossrefGoogle Scholar
  • Barraquand J., Martineau D. Numerical valuation of high dimensional multivariate American securities. J. Financial Quant. Anal. (1997) 30(3):383–405CrossrefGoogle Scholar
  • Bean J. C., Birge J. R., Smith R. L. Aggregation in dynamic programming. Oper. Res. (1987) 35:215–220LinkGoogle Scholar
  • Bertsekas D. P. Convergence of discretization procedures in dynamic programming. IEEE Trans. Automat. Control (1975) 20:415–419CrossrefGoogle Scholar
  • Bertsekas D. P. A counterexample to temporal-difference learning. Neural Comput. (1994) 7:270–279CrossrefGoogle Scholar
  • Bertsekas D. P., Castañon D. A. Adaptive aggregation for infinite horizon dynamic programming. IEEE Trans. Automat. Control (1989) 34(6):589–598CrossrefGoogle Scholar
  • Bertsekas D. P., Ioffe S. Temporal differences-based policy iteration and applications in neuro-dynamic programming. (1996) . Technical report LIDS–P–2349, MIT Laboratory for Information and Decision Systems, Cambridge, MAGoogle Scholar
  • Bertsekas D. P., Shreve S. E.Stochastic Optimal Control: The Discrete Time Case (1996) (Athena Scientific, Belmont, MA) Google Scholar
  • Bertsekas D. P., Tsitsiklis J. N.Neuro-Dynamic Programming (1996) (Athena Scientific, Belmont, MA) Google Scholar
  • Bertsekas D. P., Borkar V., Nedic A., Si J., Barto A. G., Powell W. B., Wunsch D. C. Improved temporal difference methods with linear function approximation. Handbook of Learning and Approximate Dynamic Programming (2004) (IEEE Press and John Wiley & Sons, Boston, MA) Google Scholar
  • Birge J. R. Aggregation bounds in stochastic linear programming. Math. Programming (1985) 31:25–41CrossrefGoogle Scholar
  • Blackwell D. Discrete dynamic programming. Ann. Math. Statist. (1962) 33:719–726CrossrefGoogle Scholar
  • Boyan J. A., Bratko I., Dzeroski S. Least-squares temporal difference learning. Machine Learning: Proc. 16th Internat. Conf. (ICML) (1999) Cambridge, MAGoogle Scholar
  • Boyan J. A. Technical update: Least-squares temporal difference learning. Machine Learning (2002) 49:2–3CrossrefGoogle Scholar
  • Bradtke S. J., Barto A. G. Linear least-squares algorithms for temporal-difference learning. Machine Learning (1996) 33–57Google Scholar
  • Choi D. S., Van Roy B. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Machine Learning: Proc. 18th Internat. Conf. (2001) Palo Alto, CA(ICML)Google Scholar
  • Choi D. S., Van Roy B. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems (2006) 16(2)Google Scholar
  • Chow C. S., Tsitsiklis J. N. The complexity of dynamic programming. J. Complexity (1989) 5:466–488CrossrefGoogle Scholar
  • Chow C. S., Tsitsiklis J. N. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Trans. Automat. Control (1991) 36(8):898–914CrossrefGoogle Scholar
  • de Farias D. P., Van Roy B. On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optim. Theory Appl. (2000) 105(3):589–608CrossrefGoogle Scholar
  • de Farias D. P., Van Roy B. Approximate dynamic programming via linear programming. Advances in Neural Information Processing Systems (2002) 14(MIT Press, Cambridge, MA) Google Scholar
  • de Farias D. P., Van Roy B. The linear programming approach to approximate dynamic programming. Oper. Res. (2003) 51(6):850–865LinkGoogle Scholar
  • Fox B. L. Discretizing dynamic programs. J. Optim. Theory Appl. (1973) 11:228–234CrossrefGoogle Scholar
  • Gordon G. J. Stable function approximation in dynamic programming. (1995) . Technical report CMU-CS-95-103, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
  • Gordon G. J. Stable function approximation in dynamic programming. Machine Learning: Proc. 12th Internat. Conf. (1995) San Francisco, CA(ICML)Google Scholar
  • Gordon G. J. Approximate solutions to Markov decision processes. (1999) . Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
  • Hinderer K., Puterman M. On approximate solutions of finite-stage dynamic programs. Dynamic Programming and Its Applications (1979) (Academic Press, New York) Google Scholar
  • Jaakkola T., Jordan M. I., Singh S. P. On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. (1994) 6:1185–1201CrossrefGoogle Scholar
  • Jaakkola T., Singh S. P., Jordan M. I. Reinforcement learning algorithms for partially observable Markov decision problems. Advances in Neural Information Processing Systems (1995) 7:345–352Google Scholar
  • Kushner H. J., Dupuis P. G.Numerical Methods for Stochastic Control Problems in Continuous Time (1992) (Springer-Verlag, New York) CrossrefGoogle Scholar
  • Lambert T., Epelman M. A., Smith R. L. Aggregation in stochastic dynamic programming. (2004) . Technical report 04-07, Department of Industrial and Operations Engineering, Ann Arbor, MIGoogle Scholar
  • Mendelssohn R. An iterative aggregation procedure for Markov decision processes. Oper. Res. (1982) 30:62–73LinkGoogle Scholar
  • Moore Andrew, Atkeson Chris. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning (1995) 21(3):199–233Google Scholar
  • Morin T., Puterman M. Computational advances in dynamic programming. Dynamic Programming and Its Applications. (1979) (Academic Press, New York) 202–207Google Scholar
  • Nedic A., Bertsekas D. P. Least-squares policy evaluation algorithms with linear function approximation. Discrete Event Dynam. Systems (2003) 13:79–110CrossrefGoogle Scholar
  • Rummery G. A. Problem solving with reinforcement learning. (1995) . Ph.D. thesis, Cambridge University, Cambridge, UKGoogle Scholar
  • Rummery G. A., Niranjan M. On-line Q-learning using connectionist systems. (1994) . Technical report CUED/F-INFENG/TR 166, Engineering Department, Cambridge UniversityGoogle Scholar
  • Rust J. Using randomization to break the curse of dimensionality. Econometrica (1997) 65(3):487–516CrossrefGoogle Scholar
  • Singh S. P., Yee R. C. An upper-bound on the loss from approximate optimal-value functions. Machine Learning (1994) 16(3):227–233Google Scholar
  • Sutton R. S. Temporal credit assignment in reinforcement learning. (1984) . Ph.D. thesis, University of Massachusetts Amherst, Amherst, MAGoogle Scholar
  • Sutton R. S. Learning to predict by the methods of temporal differences. Machine Learning (1988) 3:9–44CrossrefGoogle Scholar
  • Sutton R. S. On the virtues of linear learning and trajectory distributions. Proc. Workshop Value Function Approximation, Machine Learning Conf. (1995) Google Scholar
  • Sutton R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems (1996) 8(MIT Press, Cambridge, MA) Google Scholar
  • Sutton R. S., Barto A. G.Reinforcement Learning: An Introduction (1998) (MIT Press, Cambridge, MA) Google Scholar
  • Tsitsiklis J. N. Asynchronous stochastic approximation and Q-learning. Machine Learning (1994) 16:185–202CrossrefGoogle Scholar
  • Tsitsiklis J. N., Van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning (1996) 22:59–94Google Scholar
  • Tsitsiklis J. N., Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control (1997) 42(5):674–690CrossrefGoogle Scholar
  • Tsitsiklis J. N., Van Roy B. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems (1997) 9(MIT Press, Cambridge, MA) Google Scholar
  • Tsitsiklis J. N., Van Roy B. Average cost temporal-difference learning. Proc. IEEE Conf. Decision Control (1997) IEEESan Diego, CAGoogle Scholar
  • Tsitsiklis J. N., Van Roy B. Average cost temporal-difference learning. Automatica (1999) 35(11):1799–1808CrossrefGoogle Scholar
  • Tsitsiklis J. N., Van Roy B. On average versus discounted reward temporal-difference learning. Machine Learning (2002) 49:2–3CrossrefGoogle Scholar
  • Van Roy B. Learning and value function approximation in complex decision processes. (1998) . Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MAGoogle Scholar
  • Watkins C. J. C. H. Learning from delayed rewards. (1989) . Ph.D. thesis, Cambridge University, Cambridge, UKGoogle Scholar
  • Watkins C. J. C. H., Dayan P. Q-learning. Machine Learning (1992) 8:279–292CrossrefGoogle Scholar
  • Werbos P. J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook (1977) 22:25–38Google Scholar
  • Werbos P. J. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks (1990) 3:179–189CrossrefGoogle Scholar
  • Whitt W. Approximations of dynamic programs I. Math. Oper. Res. (1978) 3(3):231–243LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.