The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Ilya O. Ryzhov
Ilya O. Ryzhov
[email protected]
Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742
Search for more papers by this author
,
Warren B. Powell
Warren B. Powell
[email protected]
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544
Search for more papers by this author
,
Peter I. Frazier
Peter I. Frazier
[email protected]
Department of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853
Search for more papers by this author

Ilya O. Ryzhov

[email protected]

Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742

Search for more papers by this author

Warren B. Powell

[email protected]

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544

Search for more papers by this author

Peter I. Frazier

[email protected]

Department of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853

Search for more papers by this author

Published Online:1 Feb 2012https://doi.org/10.1287/opre.1110.0999

References

Auer P., Cesa-Bianchi N., Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learn. (2002) 47(2–3):235–256Crossref, Google Scholar
Berry D. A., Pearson L. M. Optimal designs for clinical trials with dichotomous responses. Statist. Medicine (1985) 4(4):497–508Crossref, Google Scholar
Bertsimas D., Niño-Mora J. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. (2000) 48(1):80–90Link, Google Scholar
Brezzi M., Lai T. L. Incomplete learning from endogenous data in dynamic allocation. Econometrica (2000) 68(6):1511–1516Crossref, Google Scholar
Brezzi M., Lai T. L. Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control (2002) 27(1):87–108Crossref, Google Scholar
Chick S. E., Gans N. Economic analysis of simulation selection problems. Management Sci. (2009) 55(3):421–437Link, Google Scholar
Chick S. E., Branke J., Schmidt C. Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. (2010) 22(1):71–80Link, Google Scholar
DeGroot M. H.Optimal Statistical Decisions (1970) (John Wiley & Sons, Hoboken, NJ) Google Scholar
Duff M. Q-learning for bandit problems. Proc. 12th Internat. Conf. Machine Learn. (1995) (Morgan Kaufmann, Tahoe City, CA) 209–217Crossref, Google Scholar
Feldman D. Contributions to the two-armed bandit problem. Ann. Math. Statist. (1962) 33(3):847–856Crossref, Google Scholar
Frazier P. I., Powell W. B., Dayanik S. A knowledge gradient policy for sequential information collection. SIAM J. Control Optim. (2008) 47(5):2410–2439Crossref, Google Scholar
Frazier P. I., Powell W. B., Dayanik S. The knowledge-gradient policy for correlated normal rewards. INFORMS J. Comput. (2009) 21(4):599–613Link, Google Scholar
Ginebra J., Clayton M. K. Response surface bandits. J. Royal Statist. Soc. (1995) B57(4):771–784Google Scholar
Gittins J. C.Multi-Armed Bandit Allocation Indices (1989) (John Wiley & Sons, New York) Google Scholar
Gittins J. C., Jones D. M., Gani J. A dynamic allocation index for the sequential design of experiments. Progress in Statistics (1974) (North-Holland, Amsterdam) 244–266Google Scholar
Gittins J. C., Jones D. M. A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika (1979) 66(3):561–565Crossref, Google Scholar
Goel A., Khanna S., Null B. The ratio index for budgeted learning, with applications. Proc. 19th Annual ACM-SIAM Sympos. Discrete Algorithms (2009) (SIAM, Philadelphia) 18–27Crossref, Google Scholar
Gupta S. S., Miescke K. J. Bayesian look ahead one stage sampling allocations for selecting the largest normal mean. Statist. Papers (1994) 35(1):169–177Crossref, Google Scholar
Gupta S. S., Miescke K. J. Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Statist. Planning Inference (1996) 54(2):229–244Crossref, Google Scholar
Kaelbling L. P.Learning in Embedded Systems (1993) (MIT Press, Cambridge, MA) Crossref, Google Scholar
Katehakis M. N., Veinott A. F. The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. (1987) 12(2):262–268Link, Google Scholar
Keener R. Further contributions to the “two-armed bandit” problem. Ann. Statist. (1985) 13(1):418–422Crossref, Google Scholar
Lai T. L. Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. (1987) 15(3):1091–1114Crossref, Google Scholar
Lai T. L., Robbins H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. (1985) 6(1):4–22Crossref, Google Scholar
Mahajan A., Teneketzis D., Hero A. O., Castanon D. A., Cochran D., Kastella K. Multi-armed bandit problems. Foundations and Applications of Sensor Management (2008) (Springer, New York) 121–152Crossref, Google Scholar
Mersereau A. J., Rusmevichientong P., Tsitsiklis J. N. A structured multiarmed bandit problem and the greedy policy. Proc. 47th IEEE Conf. Decision and Control (2008) (IEEE, Piscataway, NJ) 4945–4950Crossref, Google Scholar
Mersereau A. J., Rusmevichientong P., Tsitsiklis J. N. A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatic Control (2009) 54(12):2787–2802Crossref, Google Scholar
Pandey S., Chakrabarti D., Agarwal D. Multi-armed bandit problems with dependent arms. Proc. 24th Internat. Conf. Machine Learn. (2007) (New York)721–728ACM International Proceedings SeriesCrossref, Google Scholar
Powell W. B.Approximate Dynamic Programming: Solving the Curses of Dimensionality (2007) (John Wiley & Sons, New York) Crossref, Google Scholar
Ryzhov I. O., Powell W. B., Rosetti M. D., Hill R. R., Johansson B., Dunkin A., Ingalls R. G. A Monte Carlo knowledge gradient method for learning abatement potential of emissions reduction technologies. Proc. 2009 Winter Simulation Conf. (2009a) (IEEE, Piscataway, NJ) 1492–1502Crossref, Google Scholar
Ryzhov I. O., Powell W. B. The knowledge gradient algorithm for online subset selection. Proc. 2009 IEEE Sympos. Adaptive Dynam. Programming and Reinforcement Learn. (2009b) (IEEE, Piscataway, NJ) 137–144Crossref, Google Scholar
Ryzhov I. O., Powell W. B. Information collection on a graph. Oper. Res. (2011) 59(1):188–201Link, Google Scholar
Steele J. M.Stochastic Calculus and Financial Applications (2000) (Springer, New York) Google Scholar
Sutton R. S., Barto A. G.Reinforcement Learning (1998) (MIT Press, Cambridge, MA) Google Scholar
Tesauro G., Galperin G. R., Mozer M. C., Jordan M. I., Pesche T. On-line policy improvement using Monte Carlo search. Advances in Neural Information Processing Systems (1996) 9(MIT Press, Cambridge, MA) 1068–1074Google Scholar
Tewari A., Bartlett P. L., Platt J. C., Koller D., Singer Y., Roweis S. Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems (2007) 20(MIT Press, Cambridge, MA) 1505–1512Google Scholar
Vermorel J., Mohri M. Multi-armed bandit algorithms and empirical evaluation. Proc. 16th Eur. Conf. Machine Learn. (2005) (Springer-Verlag, Berlin) 437–448Crossref, Google Scholar
Washburn R., Hero A. O., Castanon D. A., Cochran D., Kastella K. Applications of multi-armed bandits to sensor management. Foundations and Applications of Sensor Management (2008) (Springer, New York) 153–176Crossref, Google Scholar
Whittle P. Multi-armed bandits and the Gittins index. J. Royal Statist. Soc. (1980) B42(2):143–149Google Scholar
Yao Y., Ho H., Ing C., Lai T. Some results on the Gittins index for a normal reward process. Time Series and Related Topics: In Memory of Ching-Zong Wei (2006) (Institute of Mathematical Statistics, Beachwood, OH) 284–294Crossref, Google Scholar

Volume 60, Issue 1

January-February 2012

Pages iii-248

Article Information

Supplemental Material

Metrics

Information

Received:September 01, 2009
Accepted:June 01, 2011
Published Online:February 01, 2012

Cite as

Ilya O. Ryzhov, Warren B. Powell, Peter I. Frazier, (2012) The Knowledge Gradient Algorithm for a General Class of Online Learning Problems. Operations Research 60(1):180-195.

https://doi.org/10.1287/opre.1110.0999

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

References

Volume 60, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News