Markov Decision Processes with Arbitrary Reward Processes

Jia Yuan Yu
Jia Yuan Yu
[email protected]
Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada
Search for more papers by this author
,
Shie Mannor
Shie Mannor
[email protected]
Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada, and Technion, Technion City, 32000 Haifa, Israel
Search for more papers by this author
,
Nahum Shimkin
Nahum Shimkin
[email protected]
Department of Electrical Engineering, Technion, Technion City, 32000 Haifa, Israel
Search for more papers by this author

Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada

Department of Electrical and Computer Engineering, McGill University, Montréal, Québec H3A 2A7, Canada, and Technion, Technion City, 32000 Haifa, Israel

Search for more papers by this author

Nahum Shimkin

[email protected]

Department of Electrical Engineering, Technion, Technion City, 32000 Haifa, Israel

Search for more papers by this author

Published Online:6 Aug 2009https://doi.org/10.1287/moor.1090.0397

References

Auer P., Cesa-Bianchi N., Freund Y., Schapire R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. (2002) 32(1):48–77Crossref, Google Scholar
Aumann R. J. Markets with a continuum of traders. Econometrica (1964) 32:39–50Crossref, Google Scholar
Bertsekas D. P.Dynamic Programming and Optimal Control (2001) 22nd ed.(Athena Scientific, Nashua, NH) Google Scholar
Bertsekas D. P., Tsitsiklis J. N.Neuro-Dynamic Programming (1996) (Athena Scientific, Nashua, NH) Google Scholar
Blackwell D. An analog of the minimax theorem for vector payoffs. Pacific J. Math. (1956) 6(1):1–8Crossref, Google Scholar
Bobkov S. G., Tetali P. Modified logarithmic Sobolev inequalities in discrete settings. J. Theoret. Probab. (2006) 19(2):289–336Crossref, Google Scholar
Borkar V. S., Meyn S. P. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. (2000) 38(2):447–469Crossref, Google Scholar
Brafman R. I., Tennenholtz M. R-max—A general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learning Res. (2003) 3:213–231Google Scholar
Cesa-Bianchi N., Lugosi G.Prediction, Learning, and Games (2006) (Cambridge University Press, New York) Crossref, Google Scholar
Crites R. H., Barto A. G. An actor/critic algorithm that is equivalent to Q-learning. Advances in Neural Information Processing Systems 7 (1995) (MIT Press, Cambridge) 401–408Google Scholar
Even-Dar E., Kakade S., Mansour Y. Experts in a Markov decision process. Advances in Neural Information Processing Systems 17 (2004) (MIT Press, Cambridge) 401–408Google Scholar
Filar J., Vrieze K.Competitive Markov Decision Processes (1997) (Springer-Verlag, New York) Google Scholar
Freund Y., Schapire R. E. Adaptive game playing using multiplicative weights. Games Econom. Behav. (1999) 29(12):79–103Crossref, Google Scholar
Fudenberg D., Kreps D. M. Learning mixed equilibria. Games Econom. Behav. (1993) 5(3):320–367Crossref, Google Scholar
Fudenberg D., Levine D. K.The Theory of Learning in Games (1998) (MIT Press, Cambridge) Google Scholar
Hannan J. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games (1957) 3(Princeton University Press, Princeton, NJ) 97–139Google Scholar
Herbster M., Warmuth M. K. Tracking the best expert. Machine Learning (1998) 32(2):151–178Crossref, Google Scholar
Kalai A., Vempala S. Efficient algorithms for online decision problems. J. Comput. System Sci. (2005) 71(3):291–307Crossref, Google Scholar
Littlestone N., Warmuth M. K. The weighted majority algorithm. Inform. Comput. (1994) 108(2):212–261Crossref, Google Scholar
Mannor S., Shimkin N. The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Math. Oper. Res. (2003) 28(2):327–345Link, Google Scholar
Mannor S., Shimkin N. Regret minimization in repeated matrix games with variable stage duration. Games Econom. Behav. (2008) 63(1):227–258Crossref, Google Scholar
Merhav N., Ordentlich E., Seroussi G., Weinberger M. J. On sequential strategies for loss functions with memory. IEEE Trans. Inform. Theory (2002) 48(7):1947–1958Crossref, Google Scholar
Renegar J. Some perturbation theory for linear programming. Math. Programming (1994) 65(1):73–91Crossref, Google Scholar
Robinson S. M. Bounds for error in the solution set of a perturbed linear program. Linear Algebra Its Appl. (1973) 6:69–81Crossref, Google Scholar
Schweitzer P. J. Perturbation theory and finite Markov chains. J. Appl. Probab. (1968) 5:410–413Crossref, Google Scholar
Shapley L. Stochastic games. Proc. National Acad. Sci. (1953) 39(10):1095–1100Crossref, Google Scholar
Watkins C., Dayan P. Q-learning. Machine Learning (1992) 8:279–292Crossref, Google Scholar
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. Proc. Twentieth Internat. Conf. Machine Learning (2003) (AAAI Press, Cambridge, MA) . http://www.hpl.hp.com/conferences/icml2003/titlesAndAuthors.htmlGoogle Scholar

cover image Mathematics of Operations Research

Volume 34, Issue 3

August 2009

Pages 513-768

Article Information

Metrics

Information

Received:August 22, 2007
Published Online:August 06, 2009

Cite as

Jia Yuan Yu, Shie Mannor, Nahum Shimkin, (2009) Markov Decision Processes with Arbitrary Reward Processes. Mathematics of Operations Research 34(3):737-757.

https://doi.org/10.1287/moor.1090.0397

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Markov Decision Processes with Arbitrary Reward Processes

References

Volume 34, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News