The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Published Online:https://doi.org/10.1287/opre.1110.0999

References

  • Auer P., Cesa-Bianchi N., Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learn. (2002) 47(2–3):235–256CrossrefGoogle Scholar
  • Berry D. A., Pearson L. M. Optimal designs for clinical trials with dichotomous responses. Statist. Medicine (1985) 4(4):497–508CrossrefGoogle Scholar
  • Bertsimas D., Niño-Mora J. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. (2000) 48(1):80–90LinkGoogle Scholar
  • Brezzi M., Lai T. L. Incomplete learning from endogenous data in dynamic allocation. Econometrica (2000) 68(6):1511–1516CrossrefGoogle Scholar
  • Brezzi M., Lai T. L. Optimal learning and experimentation in bandit problems. J. Econom. Dynam. Control (2002) 27(1):87–108CrossrefGoogle Scholar
  • Chick S. E., Gans N. Economic analysis of simulation selection problems. Management Sci. (2009) 55(3):421–437LinkGoogle Scholar
  • Chick S. E., Branke J., Schmidt C. Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. (2010) 22(1):71–80LinkGoogle Scholar
  • DeGroot M. H.Optimal Statistical Decisions (1970) (John Wiley & Sons, Hoboken, NJ) Google Scholar
  • Duff M. Q-learning for bandit problems. Proc. 12th Internat. Conf. Machine Learn. (1995) (Morgan Kaufmann, Tahoe City, CA) 209–217CrossrefGoogle Scholar
  • Feldman D. Contributions to the two-armed bandit problem. Ann. Math. Statist. (1962) 33(3):847–856CrossrefGoogle Scholar
  • Frazier P. I., Powell W. B., Dayanik S. A knowledge gradient policy for sequential information collection. SIAM J. Control Optim. (2008) 47(5):2410–2439CrossrefGoogle Scholar
  • Frazier P. I., Powell W. B., Dayanik S. The knowledge-gradient policy for correlated normal rewards. INFORMS J. Comput. (2009) 21(4):599–613LinkGoogle Scholar
  • Ginebra J., Clayton M. K. Response surface bandits. J. Royal Statist. Soc. (1995) B57(4):771–784Google Scholar
  • Gittins J. C.Multi-Armed Bandit Allocation Indices (1989) (John Wiley & Sons, New York) Google Scholar
  • Gittins J. C., Jones D. M., Gani J. A dynamic allocation index for the sequential design of experiments. Progress in Statistics (1974) (North-Holland, Amsterdam) 244–266Google Scholar
  • Gittins J. C., Jones D. M. A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika (1979) 66(3):561–565CrossrefGoogle Scholar
  • Goel A., Khanna S., Null B. The ratio index for budgeted learning, with applications. Proc. 19th Annual ACM-SIAM Sympos. Discrete Algorithms (2009) (SIAM, Philadelphia) 18–27CrossrefGoogle Scholar
  • Gupta S. S., Miescke K. J. Bayesian look ahead one stage sampling allocations for selecting the largest normal mean. Statist. Papers (1994) 35(1):169–177CrossrefGoogle Scholar
  • Gupta S. S., Miescke K. J. Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Statist. Planning Inference (1996) 54(2):229–244CrossrefGoogle Scholar
  • Kaelbling L. P.Learning in Embedded Systems (1993) (MIT Press, Cambridge, MA) CrossrefGoogle Scholar
  • Katehakis M. N., Veinott A. F. The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. (1987) 12(2):262–268LinkGoogle Scholar
  • Keener R. Further contributions to the “two-armed bandit” problem. Ann. Statist. (1985) 13(1):418–422CrossrefGoogle Scholar
  • Lai T. L. Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. (1987) 15(3):1091–1114CrossrefGoogle Scholar
  • Lai T. L., Robbins H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. (1985) 6(1):4–22CrossrefGoogle Scholar
  • Mahajan A., Teneketzis D., Hero A. O., Castanon D. A., Cochran D., Kastella K. Multi-armed bandit problems. Foundations and Applications of Sensor Management (2008) (Springer, New York) 121–152CrossrefGoogle Scholar
  • Mersereau A. J., Rusmevichientong P., Tsitsiklis J. N. A structured multiarmed bandit problem and the greedy policy. Proc. 47th IEEE Conf. Decision and Control (2008) (IEEE, Piscataway, NJ) 4945–4950CrossrefGoogle Scholar
  • Mersereau A. J., Rusmevichientong P., Tsitsiklis J. N. A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatic Control (2009) 54(12):2787–2802CrossrefGoogle Scholar
  • Pandey S., Chakrabarti D., Agarwal D. Multi-armed bandit problems with dependent arms. Proc. 24th Internat. Conf. Machine Learn. (2007) (New York)721–728ACM International Proceedings SeriesCrossrefGoogle Scholar
  • Powell W. B.Approximate Dynamic Programming: Solving the Curses of Dimensionality (2007) (John Wiley & Sons, New York) CrossrefGoogle Scholar
  • Ryzhov I. O., Powell W. B., Rosetti M. D., Hill R. R., Johansson B., Dunkin A., Ingalls R. G. A Monte Carlo knowledge gradient method for learning abatement potential of emissions reduction technologies. Proc. 2009 Winter Simulation Conf. (2009a) (IEEE, Piscataway, NJ) 1492–1502CrossrefGoogle Scholar
  • Ryzhov I. O., Powell W. B. The knowledge gradient algorithm for online subset selection. Proc. 2009 IEEE Sympos. Adaptive Dynam. Programming and Reinforcement Learn. (2009b) (IEEE, Piscataway, NJ) 137–144CrossrefGoogle Scholar
  • Ryzhov I. O., Powell W. B. Information collection on a graph. Oper. Res. (2011) 59(1):188–201LinkGoogle Scholar
  • Steele J. M.Stochastic Calculus and Financial Applications (2000) (Springer, New York) Google Scholar
  • Sutton R. S., Barto A. G.Reinforcement Learning (1998) (MIT Press, Cambridge, MA) Google Scholar
  • Tesauro G., Galperin G. R., Mozer M. C., Jordan M. I., Pesche T. On-line policy improvement using Monte Carlo search. Advances in Neural Information Processing Systems (1996) 9(MIT Press, Cambridge, MA) 1068–1074Google Scholar
  • Tewari A., Bartlett P. L., Platt J. C., Koller D., Singer Y., Roweis S. Optimistic linear programming gives logarithmic regret for irreducible MDPs. Advances in Neural Information Processing Systems (2007) 20(MIT Press, Cambridge, MA) 1505–1512Google Scholar
  • Vermorel J., Mohri M. Multi-armed bandit algorithms and empirical evaluation. Proc. 16th Eur. Conf. Machine Learn. (2005) (Springer-Verlag, Berlin) 437–448CrossrefGoogle Scholar
  • Washburn R., Hero A. O., Castanon D. A., Cochran D., Kastella K. Applications of multi-armed bandits to sensor management. Foundations and Applications of Sensor Management (2008) (Springer, New York) 153–176CrossrefGoogle Scholar
  • Whittle P. Multi-armed bandits and the Gittins index. J. Royal Statist. Soc. (1980) B42(2):143–149Google Scholar
  • Yao Y., Ho H., Ing C., Lai T. Some results on the Gittins index for a normal reward process. Time Series and Related Topics: In Memory of Ching-Zong Wei (2006) (Institute of Mathematical Statistics, Beachwood, OH) 284–294CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.