Nonparametric Approximate Dynamic Programming via the Kernel Method

Published Online:https://doi.org/10.1287/stsy.2023.0107

References

  • Amit R, Meir R, Ciosek K (2020) Discount factor as a regularizer in reinforcement learning. Proc. 37th Internat. Conf. Machine Learning (PMLR), 269–278.Google Scholar
  • Barreto AMS, Precup D, Pineau J (2011) Reinforcement Learning Using Kernel-Based Stochastic Factorization, Advances in Neural Information Processing Systems, vol. 24 (MIT Press, Cambridge, MA), 720–728.Google Scholar
  • Bartlett PL, Mendelson S (2002) Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learning Res. 3:463–482.Google Scholar
  • Bethke B, How JP, Ozdaglar A (2023) Kernel-based reinforcement learning using Bellman residual elimination. J. Machine Learning Res. Forthcoming.Google Scholar
  • Chen RR, Meyn S (1998) Value iteration and optimization of multiclass queueing networks., 1998. Proc. 37th IEEE Conf. Decision and Control, vol. 1 (IEEE, Piscataway, NJ), 50–55.Google Scholar
  • Dai JG, Lin W (2005) Maximum pressure policies in stochastic processing networks. Oper. Res. 53(2):197–218.LinkGoogle Scholar
  • De Farias DP, Van Roy B (2000) On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optim. Theory Appl. 105(3):589–608.Google Scholar
  • de Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.LinkGoogle Scholar
  • de Farias DP, Van Roy B (2004) On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res. 29:462–478.LinkGoogle Scholar
  • Desai VV, Farias VF, Moallemi CC (2012) Approximate dynamic programming via a smoothed linear program. Oper. Res. 60(3):655–674.LinkGoogle Scholar
  • Dewanto V, Gallagher M (2021) Examining average and discounted reward optimality criteria in reinforcement learning. Preprint, submitted July 3, https://arxiv.org/abs/2107.01348.Google Scholar
  • Dietterich TG, Wang X (2002) Batch Value Function Approximation via Support Vectors, Advances in Neural Information Processing Systems, vol. 14 (MIT Press, Cambridge, MA), 1491–1498.Google Scholar
  • Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Proc. 20th Internat. Conf. Machine Learning (AAAI Press, Washington, DC), 154–161.Google Scholar
  • Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learning Res. 6:503–556.Google Scholar
  • Farahmand AM, Ghavamzadeh M, Mannor S, Szepesvári C (2009a) Regularized policy iteration. Adv. Neural Inform. Processing Systems 21:441–448.Google Scholar
  • Farahmand AM, Ghavamzadeh M, Szepesvári C, Mannor S (2009b) Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. 2009 Amer. Control Conf. ACC’09 (IEEE, Piscataway, NJ), 725–730.Google Scholar
  • Harrison JM, Wein LM (1989) Scheduling network of queues: Heavy traffic analysis of a simple open network. Queueing Systems 5:265–280.Google Scholar
  • Haskell WB, Jain R, Kalathil D (2016) Empirical dynamic programming. Math. Oper. Res. 41(2):402–429.LinkGoogle Scholar
  • Jiang N, Kulesza A, Singh S, Lewis R (2015) The dependence of effective planning horizon on model accuracy. Proc. 2015 Internat. Conf. Autonomous Agents Multiagent Systems (IFAAMAS, Richland, SC), 1181–1189.Google Scholar
  • Joachims T (1999) Advances in kernel methods: Support vector learning. Schölkopf B, Burges CJC, Smola AJ, eds. Making Large-Scale Support Vector Machine Learning Practical (MIT Press, Cambridge, MA), 169–184.Google Scholar
  • Kolter JZ, Ng AY (2009) Regularization and feature selection in least-squares temporal difference learning. ICML’09 Proc. 26th Annu. Internat. Conf. Machine Learning (Association for Computing Machinery, New York), 521–528.Google Scholar
  • Kumar PR, Seidman TI (1990) Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems. IEEE Trans. Automatic Control 35(3):289–298.Google Scholar
  • Kumar S, Muthuraman K (2004) A numerical method for solving singular stochastic control problems. Oper. Res. 52(4):563–582.LinkGoogle Scholar
  • Kushner HJ, Martins LF (1996) Heavy traffic analysis of a controlled multiclass queueing network via weak convergence methods. SIAM J. Control Optim. 34(5):1781–1797.Google Scholar
  • Luenberger DG (1997) Optimization by Vector Space Methods (John Wiley & Sons, Inc., New York).Google Scholar
  • Manne AS (1960) Linear programming and sequential decisions. Management Sci. 6(3):259–267.LinkGoogle Scholar
  • Martins LF, Shreve SE, Soner HM (1996) Heavy traffic convergence of a controlled multiclass queueing network. SIAM J. Control Optim. 34(6):2133–2171.Google Scholar
  • Moallemi CC, Kumar S, Van Roy B (2008) Approximate and data-driven dynamic programming for queueing networks. Working paper, Columbia University, New York.Google Scholar
  • Ormoneit D, Glynn P (2002) Kernel-based reinforcement learning in average cost problems. IEEE Trans. Automatic Control 47(10):1624–1636.Google Scholar
  • Ormoneit D, Sen S (2002) Tree-based batch mode reinforcement learning. Machine Learning 49(2):161–178.Google Scholar
  • Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. Neural Networks Signal Processing, Proc. 1997 IEEE Workshop (IEEE, Piscataway, NJ), 276–285.Google Scholar
  • Pazis J, Parr R (2011) Non-parametric approximate linear programming for MDPs. Proc. Twenty-Fifth AAAI Conf. Artificial Intelligence (AAAI Press, Washington, DC), 459–464.Google Scholar
  • Petrik M, Taylor G, Parr R, Zilberstein S (2010) Feature selection using regularization in approximate linear programs for Markov decision processes. ICML’10 Proc. 27th Annu. Internat. Conf. Machine Learning (Association for Computing Machinery, New York), 871–879.Google Scholar
  • Scholkopf B, Smola AJ (2001) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, Cambridge, MA).Google Scholar
  • Schweitzer P, Seidman A (1985) Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. 110:568–582.Google Scholar
  • Stolyar AL (2004) Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. Ann. Appl. Probab. 14:1–53.Google Scholar
  • Tassiulas L, Ephremides A (1992) Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. Automatic Control 37(12):1936–1948.Google Scholar
  • Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.LinkGoogle Scholar
  • Xu X, Hu D, Lu X (2007) Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Networks 18(4):973–992.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.