Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning
Published Online:13 Jul 2017https://doi.org/10.1287/moor.2017.0855
References
- (1984) Differential Inclusions: Set-Valued Maps and Viability Theory (Springer, Berlin).Crossref, Google Scholar
- (1999) Dynamics of stochastic approximation algorithms. Azéma J, Émery M, Ledoux M, Yor M, eds. Séminaire de probabilités XXXIII (Springer, Berlin), 1–68.Crossref, Google Scholar
- (2005) Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1):328–348.Crossref, Google Scholar
- (1990) Adaptive Algorithms and Stochastic Approximation (Springer, New York).Crossref, Google Scholar
- (1995) Probability Theory: An Advanced Course (Springer, New York).Crossref, Google Scholar
- (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Crossref, Google Scholar
- (2006) Stochastic approximation with “controlled Markov noise.” Systems Control Lett. 55(2):139–145.Crossref, Google Scholar
- (2008) Stochastic Approximation: A Dynamic Systems Viewpoint (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2012) Linear off-policy actor-critic. Proc. 29th Internat. Conf. Machine Learning, ICML, ’12 (Omnipress, Madison, WI).Google Scholar
- (2003) Linear stochastic approximation driven by slowly varying Markov chains. Systems Control Lett. 50(2): 95–102.Crossref, Google Scholar
- (2003) On actor-critic algorithms. SIAM J. Control Optim. 42(4):1143–1166.Crossref, Google Scholar
- (1990) Stochastic approximations for finite state Markov chains. Stochastic Processes Their Appl. 35(1):27–45.Crossref, Google Scholar
- (2011) Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta, Alberta, Canada.Google Scholar
- (2005) Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1):215–238.Crossref, Google Scholar
- (1984) Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Trans. Inform. Theory 30(2):140–151.Crossref, Google Scholar
- (1976) Principles of Mathematical Analysis, 3rd ed. (McGraw-Hill, New York).Google Scholar
- (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inform. Processing Systems 21, NIPS ’08.Google Scholar
- (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Pohoreckyj Danyluk A, Bottou L, Littman ML eds. Proc. 26th Internat. Conf. Machine Learning, ICML ’10 (ACM, New York), 993–1000.Crossref, Google Scholar
- (2004) Almost sure convergence of two time-scale stochastic approximation algorithms. Proc. 2004 Amer. Control Conf. (IEEE, Piscataway, NJ).Crossref, Google Scholar
- (2015) Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema. Stochastic Processes their Appl. 125(5):1715–1755.Crossref, Google Scholar
- (2012) Least squares temporal difference methods: An analysis under general conditions. SIAM J. Control Optim. 50(6):3310–3343.Crossref, Google Scholar
- (2016) Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize. J. Machine Learning Res. 17(220):1–58.Google Scholar

