Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Zheng Wen
Corresponding Author
Zheng Wen
[email protected]
Adobe Research, San Jose, California 95110
Search for more papers by this author
,
Benjamin Van Roy
Benjamin Van Roy
[email protected]
Stanford University, Stanford, California 94305
Search for more papers by this author

Zheng Wen

Corresponding Author

Zheng Wen

[email protected]

Adobe Research, San Jose, California 95110

Search for more papers by this author

Benjamin Van Roy

[email protected]

Stanford University, Stanford, California 94305

Search for more papers by this author

Published Online:2 Mar 2017https://doi.org/10.1287/moor.2016.0826

References

Abbasi-Yadkori Y, Szepesvári C (2011) Regret bounds for the adaptive control of linear quadratic systems. J. Machine Learn. Res.—Proc. Track 19:1–26.Google Scholar
Auer P, Ortner R (2006) Logarithmic online regret bounds for undiscounted reinforcement learning. Schölkopf PB, Platt JC, Hoffman T, eds. Adv. Neural Inform. Processing Systems (NIPS 2006), Vol. 19 (MIT Press, Cambridge, MA), 49–56.Google Scholar
Azar MG, Lazaric A, Brunskill E (2013) Regret bounds for reinforcement learning with policy advice. Daelemans W, Morik K, eds. Machine Learning and Knowledge Discovery in Databases (Springer, Berlin), 97–112.Crossref, Google Scholar
Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Bilmes JA, Ng AY, eds. Proc. 25th Conf. Uncertainty Artificial Intelligence (UAI2009) (AUAI Press, Arlington, VA), 35–42.Google Scholar
Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Brafman RI, Tennenholtz M (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
Gordon G (1995) Online fitted reinforcement learning. Touretzky DS, Mozer MC, Hasselmo ME, eds. Adv. Neural Inform. Processing Systems (NIPS 1995), Vol. 8 (MIT Press, Cambridge, MA), 1052–1058.Google Scholar
Ibrahimi M, Javanmard A, Van Roy B (2012) Efficient reinforcement learning for high dimensional linear quadratic systems. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2012), Vol. 25 (Curran Associates, Red Hook, NY), 2636–2644.Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
Kakade S (2003) On the sample complexity of reinforcement learning. Ph.D. thesis, University College London, London.Google Scholar
Karmarkar N (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4(4):373–396.Crossref, Google Scholar
Kearns MJ, Koller D (1999) Efficient reinforcement learning in factored MDPs. Proc. 16th Internat. Joint Conf. Artificial Intelligence, Vol. 2 (Morgan Kaufmann, San Francisco), 740–747.Google Scholar
Kearns MJ, Singh SP (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2–3):209–232.Crossref, Google Scholar
Lagoudakis MG, Parr R, Littman ML (2002) Least-squares methods in reinforcement learning for control. Vlahavas IP, Spyropoulos CD, eds. Methods and Applications of Artificial Intelligence (Springer, Berlin), 249–260.Crossref, Google Scholar
Lattimore T, Hutter M, Sunehag P (2013) The sample-complexity of general reinforcement learning. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learning, 28–36.Google Scholar
Li L, Littman M (2010) Reducing reinforcement learning to KWIK online regression. Ann. Math. Artificial Intelligence 58(2):319–325.Google Scholar
Li L, Littman ML, Walsh TJ, Strehl AL (2011) Knows what it knows: A framework for self-aware learning. Machine Learn. 82(3):399–443.Crossref, Google Scholar
Neu G, György A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence N, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., 805–813.Google Scholar
Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Adv. Neural Inform. Processing Systems (NIPS 2010), Vol. 23 (Curran Associates, Red Hook, NY), 1804–1812.Google Scholar
Ortner R, Ryabko D (2012) Online regret bounds for undiscounted continuous reinforcement learning. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2012), Vol. 25 (Curran Associates, Red Hook, NY), 1763–1771.Google Scholar
Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2013), Vol. 26 (Curran Associates, Red Hook, NY), 3003–3011.Google Scholar
Pazis J, Parr R (2013) PAC optimal exploration in continuous space Markov decision processes. desJardins M, Littman ML, eds. Proc. 27th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
Powell W (2007) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Powell W, Ryzhov I (2011) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Google Scholar
Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University, Cambridge, UK.Google Scholar
Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Ryzhov IO, Powell WB (2010) Approximate dynamic programming with correlated Bayesian beliefs. Viswanath P, Meyn S, eds. Proc. 2010 48th Annual Allerton Conf. Comm., Control, Comput. (Allerton) (IEEE, Piscataway, NJ), 1360–1367.Crossref, Google Scholar
Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Tesauro G, Touretzky DS, Leen TK, eds. Adv. Neural Inform. Processing Systems (NIPS 1994), Vol. 7 (MIT Press, Cambridge, MA), 361–368.Google Scholar
Strehl AL (2007) Probably approximately correct (PAC) exploration in reinforcement learning. Ph.D. thesis, Rutgers, The State University of New Jersey, New Brunswick.Google Scholar
Strehl EL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds. Proc. 23rd Internat. Conf. Machine Learn. (ACM, New York), 881–888.Crossref, Google Scholar
Sutton R, Barto A (1998) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Morgan & Claypool Publishers, San Rafael, CA).Google Scholar
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.Crossref, Google Scholar
Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., 2377–2386.Google Scholar
Wang HO, Tanaka K, Griffin MF (1996) An approach to fuzzy control of nonlinear systems: Stability and design issues. IEEE Trans. Fuzzy Systems 4(1):14–23.Crossref, Google Scholar
Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2013), Vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
Whitehead SD (1991) Complexity and cooperation in Q-learning. Birnbaum LA, Collins GC, eds. Proc. 8th Internat. Workshop Machine Learn. (Morgan Kaufmann, San Francisco), 363–367.Crossref, Google Scholar

cover image Mathematics of Operations Research

Volume 42, Issue 3

August 2017

Pages 577-896

Article Information

Metrics

Information

Received:April 21, 2015
Accepted:August 06, 2016
Published Online:March 02, 2017

Cite as

Zheng Wen, Benjamin Van Roy (2017) Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization. Mathematics of Operations Research 42(3):762-782.

https://doi.org/10.1287/moor.2016.0826

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

References

Volume 42, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News