Performance Loss Bounds for Approximate Value Iteration with State Aggregation

Benjamin Van Roy
Benjamin Van Roy
[email protected]
Stanford University, Stanford, California 94305
Search for more papers by this author

Benjamin Van Roy

[email protected]

Stanford University, Stanford, California 94305

Search for more papers by this author

Published Online:1 May 2006https://doi.org/10.1287/moor.1060.0188

References

Axsäter S. State aggregation in dynamic programming: An application to scheduling of independent jobs on parallel processors. Oper. Res. Lett. (1983) 2:171–176Crossref, Google Scholar
Barraquand J., Martineau D. Numerical valuation of high dimensional multivariate American securities. J. Financial Quant. Anal. (1997) 30(3):383–405Crossref, Google Scholar
Bean J. C., Birge J. R., Smith R. L. Aggregation in dynamic programming. Oper. Res. (1987) 35:215–220Link, Google Scholar
Bertsekas D. P. Convergence of discretization procedures in dynamic programming. IEEE Trans. Automat. Control (1975) 20:415–419Crossref, Google Scholar
Bertsekas D. P. A counterexample to temporal-difference learning. Neural Comput. (1994) 7:270–279Crossref, Google Scholar
Bertsekas D. P., Castañon D. A. Adaptive aggregation for infinite horizon dynamic programming. IEEE Trans. Automat. Control (1989) 34(6):589–598Crossref, Google Scholar
Bertsekas D. P., Ioffe S. Temporal differences-based policy iteration and applications in neuro-dynamic programming. (1996) . Technical report LIDS–P–2349, MIT Laboratory for Information and Decision Systems, Cambridge, MAGoogle Scholar
Bertsekas D. P., Shreve S. E.Stochastic Optimal Control: The Discrete Time Case (1996) (Athena Scientific, Belmont, MA) Google Scholar
Bertsekas D. P., Tsitsiklis J. N.Neuro-Dynamic Programming (1996) (Athena Scientific, Belmont, MA) Google Scholar
Bertsekas D. P., Borkar V., Nedic A., Si J., Barto A. G., Powell W. B., Wunsch D. C. Improved temporal difference methods with linear function approximation. Handbook of Learning and Approximate Dynamic Programming (2004) (IEEE Press and John Wiley & Sons, Boston, MA) Google Scholar
Birge J. R. Aggregation bounds in stochastic linear programming. Math. Programming (1985) 31:25–41Crossref, Google Scholar
Blackwell D. Discrete dynamic programming. Ann. Math. Statist. (1962) 33:719–726Crossref, Google Scholar
Boyan J. A., Bratko I., Dzeroski S. Least-squares temporal difference learning. Machine Learning: Proc. 16th Internat. Conf. (ICML) (1999) Cambridge, MAGoogle Scholar
Boyan J. A. Technical update: Least-squares temporal difference learning. Machine Learning (2002) 49:2–3Crossref, Google Scholar
Bradtke S. J., Barto A. G. Linear least-squares algorithms for temporal-difference learning. Machine Learning (1996) 33–57Google Scholar
Choi D. S., Van Roy B. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Machine Learning: Proc. 18th Internat. Conf. (2001) Palo Alto, CA(ICML)Google Scholar
Choi D. S., Van Roy B. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems (2006) 16(2)Google Scholar
Chow C. S., Tsitsiklis J. N. The complexity of dynamic programming. J. Complexity (1989) 5:466–488Crossref, Google Scholar
Chow C. S., Tsitsiklis J. N. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Trans. Automat. Control (1991) 36(8):898–914Crossref, Google Scholar
de Farias D. P., Van Roy B. On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optim. Theory Appl. (2000) 105(3):589–608Crossref, Google Scholar
de Farias D. P., Van Roy B. Approximate dynamic programming via linear programming. Advances in Neural Information Processing Systems (2002) 14(MIT Press, Cambridge, MA) Google Scholar
de Farias D. P., Van Roy B. The linear programming approach to approximate dynamic programming. Oper. Res. (2003) 51(6):850–865Link, Google Scholar
Fox B. L. Discretizing dynamic programs. J. Optim. Theory Appl. (1973) 11:228–234Crossref, Google Scholar
Gordon G. J. Stable function approximation in dynamic programming. (1995) . Technical report CMU-CS-95-103, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
Gordon G. J. Stable function approximation in dynamic programming. Machine Learning: Proc. 12th Internat. Conf. (1995) San Francisco, CA(ICML)Google Scholar
Gordon G. J. Approximate solutions to Markov decision processes. (1999) . Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
Hinderer K., Puterman M. On approximate solutions of finite-stage dynamic programs. Dynamic Programming and Its Applications (1979) (Academic Press, New York) Google Scholar
Jaakkola T., Jordan M. I., Singh S. P. On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. (1994) 6:1185–1201Crossref, Google Scholar
Jaakkola T., Singh S. P., Jordan M. I. Reinforcement learning algorithms for partially observable Markov decision problems. Advances in Neural Information Processing Systems (1995) 7:345–352Google Scholar
Kushner H. J., Dupuis P. G.Numerical Methods for Stochastic Control Problems in Continuous Time (1992) (Springer-Verlag, New York) Crossref, Google Scholar
Lambert T., Epelman M. A., Smith R. L. Aggregation in stochastic dynamic programming. (2004) . Technical report 04-07, Department of Industrial and Operations Engineering, Ann Arbor, MIGoogle Scholar
Mendelssohn R. An iterative aggregation procedure for Markov decision processes. Oper. Res. (1982) 30:62–73Link, Google Scholar
Moore Andrew, Atkeson Chris. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning (1995) 21(3):199–233Google Scholar
Morin T., Puterman M. Computational advances in dynamic programming. Dynamic Programming and Its Applications. (1979) (Academic Press, New York) 202–207Google Scholar
Nedic A., Bertsekas D. P. Least-squares policy evaluation algorithms with linear function approximation. Discrete Event Dynam. Systems (2003) 13:79–110Crossref, Google Scholar
Rummery G. A. Problem solving with reinforcement learning. (1995) . Ph.D. thesis, Cambridge University, Cambridge, UKGoogle Scholar
Rummery G. A., Niranjan M. On-line Q-learning using connectionist systems. (1994) . Technical report CUED/F-INFENG/TR 166, Engineering Department, Cambridge UniversityGoogle Scholar
Rust J. Using randomization to break the curse of dimensionality. Econometrica (1997) 65(3):487–516Crossref, Google Scholar
Singh S. P., Yee R. C. An upper-bound on the loss from approximate optimal-value functions. Machine Learning (1994) 16(3):227–233Google Scholar
Sutton R. S. Temporal credit assignment in reinforcement learning. (1984) . Ph.D. thesis, University of Massachusetts Amherst, Amherst, MAGoogle Scholar
Sutton R. S. Learning to predict by the methods of temporal differences. Machine Learning (1988) 3:9–44Crossref, Google Scholar
Sutton R. S. On the virtues of linear learning and trajectory distributions. Proc. Workshop Value Function Approximation, Machine Learning Conf. (1995) Google Scholar
Sutton R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems (1996) 8(MIT Press, Cambridge, MA) Google Scholar
Sutton R. S., Barto A. G.Reinforcement Learning: An Introduction (1998) (MIT Press, Cambridge, MA) Google Scholar
Tsitsiklis J. N. Asynchronous stochastic approximation and Q-learning. Machine Learning (1994) 16:185–202Crossref, Google Scholar
Tsitsiklis J. N., Van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning (1996) 22:59–94Google Scholar
Tsitsiklis J. N., Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control (1997) 42(5):674–690Crossref, Google Scholar
Tsitsiklis J. N., Van Roy B. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems (1997) 9(MIT Press, Cambridge, MA) Google Scholar
Tsitsiklis J. N., Van Roy B. Average cost temporal-difference learning. Proc. IEEE Conf. Decision Control (1997) IEEESan Diego, CAGoogle Scholar
Tsitsiklis J. N., Van Roy B. Average cost temporal-difference learning. Automatica (1999) 35(11):1799–1808Crossref, Google Scholar
Tsitsiklis J. N., Van Roy B. On average versus discounted reward temporal-difference learning. Machine Learning (2002) 49:2–3Crossref, Google Scholar
Van Roy B. Learning and value function approximation in complex decision processes. (1998) . Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MAGoogle Scholar
Watkins C. J. C. H. Learning from delayed rewards. (1989) . Ph.D. thesis, Cambridge University, Cambridge, UKGoogle Scholar
Watkins C. J. C. H., Dayan P. Q-learning. Machine Learning (1992) 8:279–292Crossref, Google Scholar
Werbos P. J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook (1977) 22:25–38Google Scholar
Werbos P. J. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks (1990) 3:179–189Crossref, Google Scholar
Whitt W. Approximations of dynamic programs I. Math. Oper. Res. (1978) 3(3):231–243Link, Google Scholar

cover image Mathematics of Operations Research

Volume 31, Issue 2

May 2006

Pages 217-432

Article Information

Metrics

Information

Received:August 02, 2004
Published Online:May 01, 2006

Cite as

Benjamin Van Roy, (2006) Performance Loss Bounds for Approximate Value Iteration with State Aggregation. Mathematics of Operations Research 31(2):234-244.

https://doi.org/10.1287/moor.1060.0188

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Performance Loss Bounds for Approximate Value Iteration with State Aggregation

References

Volume 31, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News