Amit R, Meir R, Ciosek K (2020) Discount factor as a regularizer in reinforcement learning. Proc. 37th Internat. Conf. Machine Learning (PMLR), 269–278.Google Scholar
Barreto AMS, Precup D, Pineau J (2011) Reinforcement Learning Using Kernel-Based Stochastic Factorization, Advances in Neural Information Processing Systems, vol. 24 (MIT Press, Cambridge, MA), 720–728.Google Scholar
Bartlett PL, Mendelson S (2002) Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learning Res. 3:463–482.Google Scholar
Bethke B, How JP, Ozdaglar A (2023) Kernel-based reinforcement learning using Bellman residual elimination. J. Machine Learning Res. Forthcoming.Google Scholar
Chen RR, Meyn S (1998) Value iteration and optimization of multiclass queueing networks., 1998. Proc. 37th IEEE Conf. Decision and Control, vol. 1 (IEEE, Piscataway, NJ), 50–55.Google Scholar
Dai JG, Lin W (2005) Maximum pressure policies in stochastic processing networks. Oper. Res. 53(2):197–218.Link, Google Scholar
De Farias DP, Van Roy B (2000) On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optim. Theory Appl. 105(3):589–608.Google Scholar
de Farias DP, Van Roy B (2003) The linear programming approach to approximate dynamic programming. Oper. Res. 51(6):850–865.Link, Google Scholar
de Farias DP, Van Roy B (2004) On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res. 29:462–478.Link, Google Scholar
Desai VV, Farias VF, Moallemi CC (2012) Approximate dynamic programming via a smoothed linear program. Oper. Res. 60(3):655–674.Link, Google Scholar
Dewanto V, Gallagher M (2021) Examining average and discounted reward optimality criteria in reinforcement learning. Preprint, submitted July 3, https://arxiv.org/abs/2107.01348.Google Scholar
Dietterich TG, Wang X (2002) Batch Value Function Approximation via Support Vectors, Advances in Neural Information Processing Systems, vol. 14 (MIT Press, Cambridge, MA), 1491–1498.Google Scholar
Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Proc. 20th Internat. Conf. Machine Learning (AAAI Press, Washington, DC), 154–161.Google Scholar
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learning Res. 6:503–556.Google Scholar
Farahmand AM, Ghavamzadeh M, Mannor S, Szepesvári C (2009a) Regularized policy iteration. Adv. Neural Inform. Processing Systems 21:441–448.Google Scholar
Farahmand AM, Ghavamzadeh M, Szepesvári C, Mannor S (2009b) Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. 2009 Amer. Control Conf. ACC’09 (IEEE, Piscataway, NJ), 725–730.Google Scholar
Harrison JM, Wein LM (1989) Scheduling network of queues: Heavy traffic analysis of a simple open network. Queueing Systems 5:265–280.Google Scholar
Haskell WB, Jain R, Kalathil D (2016) Empirical dynamic programming. Math. Oper. Res. 41(2):402–429.Link, Google Scholar
Jiang N, Kulesza A, Singh S, Lewis R (2015) The dependence of effective planning horizon on model accuracy. Proc. 2015 Internat. Conf. Autonomous Agents Multiagent Systems (IFAAMAS, Richland, SC), 1181–1189.Google Scholar
Joachims T (1999) Advances in kernel methods: Support vector learning. Schölkopf B, Burges CJC, Smola AJ, eds. Making Large-Scale Support Vector Machine Learning Practical (MIT Press, Cambridge, MA), 169–184.Google Scholar
Kolter JZ, Ng AY (2009) Regularization and feature selection in least-squares temporal difference learning. ICML’09 Proc. 26th Annu. Internat. Conf. Machine Learning (Association for Computing Machinery, New York), 521–528.Google Scholar
Kumar PR, Seidman TI (1990) Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems. IEEE Trans. Automatic Control 35(3):289–298.Google Scholar
Kumar S, Muthuraman K (2004) A numerical method for solving singular stochastic control problems. Oper. Res. 52(4):563–582.Link, Google Scholar
Kushner HJ, Martins LF (1996) Heavy traffic analysis of a controlled multiclass queueing network via weak convergence methods. SIAM J. Control Optim. 34(5):1781–1797.Google Scholar
Luenberger DG (1997) Optimization by Vector Space Methods (John Wiley & Sons, Inc., New York).Google Scholar
Manne AS (1960) Linear programming and sequential decisions. Management Sci. 6(3):259–267.Link, Google Scholar
Martins LF, Shreve SE, Soner HM (1996) Heavy traffic convergence of a controlled multiclass queueing network. SIAM J. Control Optim. 34(6):2133–2171.Google Scholar
Moallemi CC, Kumar S, Van Roy B (2008) Approximate and data-driven dynamic programming for queueing networks. Working paper, Columbia University, New York.Google Scholar
Ormoneit D, Glynn P (2002) Kernel-based reinforcement learning in average cost problems. IEEE Trans. Automatic Control 47(10):1624–1636.Google Scholar
Ormoneit D, Sen S (2002) Tree-based batch mode reinforcement learning. Machine Learning 49(2):161–178.Google Scholar
Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. Neural Networks Signal Processing, Proc. 1997 IEEE Workshop (IEEE, Piscataway, NJ), 276–285.Google Scholar
Pazis J, Parr R (2011) Non-parametric approximate linear programming for MDPs. Proc. Twenty-Fifth AAAI Conf. Artificial Intelligence (AAAI Press, Washington, DC), 459–464.Google Scholar
Petrik M, Taylor G, Parr R, Zilberstein S (2010) Feature selection using regularization in approximate linear programs for Markov decision processes. ICML’10 Proc. 27th Annu. Internat. Conf. Machine Learning (Association for Computing Machinery, New York), 871–879.Google Scholar
Scholkopf B, Smola AJ (2001) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (The MIT Press, Cambridge, MA).Google Scholar
Schweitzer P, Seidman A (1985) Generalized polynomial approximations in Markovian decision processes. J. Math. Anal. Appl. 110:568–582.Google Scholar
Stolyar AL (2004) Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. Ann. Appl. Probab. 14:1–53.Google Scholar
Tassiulas L, Ephremides A (1992) Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. Automatic Control 37(12):1936–1948.Google Scholar
Van Roy B (2006) Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31(2):234–244.Link, Google Scholar
Xu X, Hu D, Lu X (2007) Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans. Neural Networks 18(4):973–992.Google Scholar

Volume 13, Issue 3

September 2023

Pages 321-397

Article Information

Metrics

Information

Received:January 05, 2018
Accepted:January 09, 2023
Published Online:March 10, 2023

Cite as

Nikhil Bhat, Vivek F. Farias, Ciamac C. Moallemi, Andrew T. Zheng (2023) Nonparametric Approximate Dynamic Programming via the Kernel Method. Stochastic Systems 13(3):321-342.

https://doi.org/10.1287/stsy.2023.0107

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Nonparametric Approximate Dynamic Programming via the Kernel Method

References

Volume 13, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News