Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning

Prasenjit Karmakar
Corresponding Author
Prasenjit Karmakar
[email protected]
http://orcid.org/0000-0001-6895-2364
Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India
Search for more papers by this author
,
Shalabh Bhatnagar
Shalabh Bhatnagar
[email protected]
Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India
Search for more papers by this author

Prasenjit Karmakar

Corresponding Author

Prasenjit Karmakar

[email protected]

http://orcid.org/0000-0001-6895-2364

Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India

Search for more papers by this author

Shalabh Bhatnagar

[email protected]

Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India

Search for more papers by this author

Published Online:13 Jul 2017https://doi.org/10.1287/moor.2017.0855

References

Aubin J, Cellina A (1984) Differential Inclusions: Set-Valued Maps and Viability Theory (Springer, Berlin).Crossref, Google Scholar
Benaïm M (1999) Dynamics of stochastic approximation algorithms. Azéma J, Émery M, Ledoux M, Yor M, eds. Séminaire de probabilités XXXIII (Springer, Berlin), 1–68.Crossref, Google Scholar
Benaïm M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1):328–348.Crossref, Google Scholar
Benveniste A, Metivier M, Priouret P (1990) Adaptive Algorithms and Stochastic Approximation (Springer, New York).Crossref, Google Scholar
Borkar VS (1995) Probability Theory: An Advanced Course (Springer, New York).Crossref, Google Scholar
Borkar VS (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Crossref, Google Scholar
Borkar VS (2006) Stochastic approximation with “controlled Markov noise.” Systems Control Lett. 55(2):139–145.Crossref, Google Scholar
Borkar VS (2008) Stochastic Approximation: A Dynamic Systems Viewpoint (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Degris T, White M, Sutton RS (2012) Linear off-policy actor-critic. Proc. 29th Internat. Conf. Machine Learning, ICML, ’12 (Omnipress, Madison, WI).Google Scholar
Konda VR, Tsitsiklis JN (2003) Linear stochastic approximation driven by slowly varying Markov chains. Systems Control Lett. 50(2): 95–102.Crossref, Google Scholar
Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J. Control Optim. 42(4):1143–1166.Crossref, Google Scholar
Ma DJ, Makowski AM, Shwartz A (1990) Stochastic approximations for finite state Markov chains. Stochastic Processes Their Appl. 35(1):27–45.Crossref, Google Scholar
Maei HR (2011) Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta, Alberta, Canada.Google Scholar
Menache I, Mannor S, Shimkin N (2005) Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1):215–238.Crossref, Google Scholar
Metivier M, Priouret P (1984) Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Trans. Inform. Theory 30(2):140–151.Crossref, Google Scholar
Rudin W (1976) Principles of Mathematical Analysis, 3rd ed. (McGraw-Hill, New York).Google Scholar
Sutton RS, Maei RS, Szepesvári C (2008) A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inform. Processing Systems 21, NIPS ’08.Google Scholar
Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Pohoreckyj Danyluk A, Bottou L, Littman ML eds. Proc. 26th Internat. Conf. Machine Learning, ICML ’10 (ACM, New York), 993–1000.Crossref, Google Scholar
Tadić VB (2004) Almost sure convergence of two time-scale stochastic approximation algorithms. Proc. 2004 Amer. Control Conf. (IEEE, Piscataway, NJ).Crossref, Google Scholar
Tadić VB (2015) Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema. Stochastic Processes their Appl. 125(5):1715–1755.Crossref, Google Scholar
Yu H (2012) Least squares temporal difference methods: An analysis under general conditions. SIAM J. Control Optim. 50(6):3310–3343.Crossref, Google Scholar
Yu H (2016) Weak convergence properties of constrained emphatic temporal-difference learning with constant and slowly diminishing stepsize. J. Machine Learning Res. 17(220):1–58.Google Scholar

cover image Mathematics of Operations Research

Volume 43, Issue 1

February 2018

Pages 1-346, C2

Article Information

Metrics

Information

Received:April 13, 2015
Accepted:February 07, 2017
Published Online:July 13, 2017

Cite as

Prasenjit Karmakar, Shalabh Bhatnagar (2017) Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning. Mathematics of Operations Research 43(1):130-151.

https://doi.org/10.1287/moor.2017.0855

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning

References

Volume 43, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News