Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Published Online:https://doi.org/10.1287/moor.2016.0826

References

  • Abbasi-Yadkori Y, Szepesvári C (2011) Regret bounds for the adaptive control of linear quadratic systems. J. Machine Learn. Res.—Proc. Track 19:1–26.Google Scholar
  • Auer P, Ortner R (2006) Logarithmic online regret bounds for undiscounted reinforcement learning. Schölkopf PB, Platt JC, Hoffman T, eds. Adv. Neural Inform. Processing Systems (NIPS 2006), Vol. 19 (MIT Press, Cambridge, MA), 49–56.Google Scholar
  • Azar MG, Lazaric A, Brunskill E (2013) Regret bounds for reinforcement learning with policy advice. Daelemans W, Morik K, eds. Machine Learning and Knowledge Discovery in Databases (Springer, Berlin), 97–112.CrossrefGoogle Scholar
  • Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Bilmes JA, Ng AY, eds. Proc. 25th Conf. Uncertainty Artificial Intelligence (UAI2009) (AUAI Press, Arlington, VA), 35–42.Google Scholar
  • Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Brafman RI, Tennenholtz M (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(October):213–231.Google Scholar
  • Gordon G (1995) Online fitted reinforcement learning. Touretzky DS, Mozer MC, Hasselmo ME, eds. Adv. Neural Inform. Processing Systems (NIPS 1995), Vol. 8 (MIT Press, Cambridge, MA), 1052–1058.Google Scholar
  • Ibrahimi M, Javanmard A, Van Roy B (2012) Efficient reinforcement learning for high dimensional linear quadratic systems. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2012), Vol. 25 (Curran Associates, Red Hook, NY), 2636–2644.Google Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
  • Kakade S (2003) On the sample complexity of reinforcement learning. Ph.D. thesis, University College London, London.Google Scholar
  • Karmarkar N (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4(4):373–396.CrossrefGoogle Scholar
  • Kearns MJ, Koller D (1999) Efficient reinforcement learning in factored MDPs. Proc. 16th Internat. Joint Conf. Artificial Intelligence, Vol. 2 (Morgan Kaufmann, San Francisco), 740–747.Google Scholar
  • Kearns MJ, Singh SP (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2–3):209–232.CrossrefGoogle Scholar
  • Lagoudakis MG, Parr R, Littman ML (2002) Least-squares methods in reinforcement learning for control. Vlahavas IP, Spyropoulos CD, eds. Methods and Applications of Artificial Intelligence (Springer, Berlin), 249–260.CrossrefGoogle Scholar
  • Lattimore T, Hutter M, Sunehag P (2013) The sample-complexity of general reinforcement learning. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learning, 28–36.Google Scholar
  • Li L, Littman M (2010) Reducing reinforcement learning to KWIK online regression. Ann. Math. Artificial Intelligence 58(2):319–325.Google Scholar
  • Li L, Littman ML, Walsh TJ, Strehl AL (2011) Knows what it knows: A framework for self-aware learning. Machine Learn. 82(3):399–443.CrossrefGoogle Scholar
  • Neu G, György A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence N, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., 805–813.Google Scholar
  • Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Adv. Neural Inform. Processing Systems (NIPS 2010), Vol. 23 (Curran Associates, Red Hook, NY), 1804–1812.Google Scholar
  • Ortner R, Ryabko D (2012) Online regret bounds for undiscounted continuous reinforcement learning. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2012), Vol. 25 (Curran Associates, Red Hook, NY), 1763–1771.Google Scholar
  • Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2013), Vol. 26 (Curran Associates, Red Hook, NY), 3003–3011.Google Scholar
  • Pazis J, Parr R (2013) PAC optimal exploration in continuous space Markov decision processes. desJardins M, Littman ML, eds. Proc. 27th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
  • Powell W (2007) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Powell W, Ryzhov I (2011) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University, Cambridge, UK.Google Scholar
  • Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Ryzhov IO, Powell WB (2010) Approximate dynamic programming with correlated Bayesian beliefs. Viswanath P, Meyn S, eds. Proc. 2010 48th Annual Allerton Conf. Comm., Control, Comput. (Allerton) (IEEE, Piscataway, NJ), 1360–1367.CrossrefGoogle Scholar
  • Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Tesauro G, Touretzky DS, Leen TK, eds. Adv. Neural Inform. Processing Systems (NIPS 1994), Vol. 7 (MIT Press, Cambridge, MA), 361–368.Google Scholar
  • Strehl AL (2007) Probably approximately correct (PAC) exploration in reinforcement learning. Ph.D. thesis, Rutgers, The State University of New Jersey, New Brunswick.Google Scholar
  • Strehl EL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds. Proc. 23rd Internat. Conf. Machine Learn. (ACM, New York), 881–888.CrossrefGoogle Scholar
  • Sutton R, Barto A (1998) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Morgan & Claypool Publishers, San Rafael, CA).Google Scholar
  • Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1–3):59–94.CrossrefGoogle Scholar
  • Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.LinkGoogle Scholar
  • Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., 2377–2386.Google Scholar
  • Wang HO, Tanaka K, Griffin MF (1996) An approach to fuzzy control of nonlinear systems: Stability and design issues. IEEE Trans. Fuzzy Systems 4(1):14–23.CrossrefGoogle Scholar
  • Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS 2013), Vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
  • Whitehead SD (1991) Complexity and cooperation in Q-learning. Birnbaum LA, Collins GC, eds. Proc. 8th Internat. Workshop Machine Learn. (Morgan Kaufmann, San Francisco), 363–367.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.