Bayesian Incentive-Compatible Bandit Exploration

Published Online:https://doi.org/10.1287/opre.2019.1949

References

  • Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. 31st Internat. Conf. Machine Learn. (ICML), 1638–1646.Google Scholar
  • Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. 28th Conf. Learn Theory (COLT), 23–35.Google Scholar
  • Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems (NIPS), vol. 27 (Curran Associates, Red Hook, NY), 1610–1618.Google Scholar
  • Arango J, Arts K, Braunschweiger P, Chadwick GL, Forster DG, Hansen K (2012) Good Clinical Practice Guide (CITI Program, Miami).Google Scholar
  • Athey S, Segal I (2013) An efficient dynamic mechanism. Econometrica 81(6):2463–2485.Google Scholar
  • Audibert JY, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11:2785–2836.Google Scholar
  • Audibert J-Y, Bubeck S, Lugosi G (2011) Minimax policies for combinatorial prediction games. 24th Conf. Learn. Theory (COLT), 107–132.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Google Scholar
  • Babaioff M, Kleinberg R, Slivkins A (2015a) Truthful mechanisms with implicit payment computation. J. ACM 62(2):Article 10.Google Scholar
  • Babaioff M, Sharma, Y, Slivkins A (2014) Characterizing truthful multi-armed bandit mechanisms. SIAM J. Comput. 43(1):194–230.Google Scholar
  • Babaioff M, Dughmi S, Kleinberg RD, Slivkins A (2015b) Dynamic pricing with limited supply. ACM Trans. Econom. Comput. 3(1):4.Google Scholar
  • Badanidiyuru A, Kleinberg R, Slivkins A (2018) Bandits with knapsacks. J. ACM 65(3):Article 13.Google Scholar
  • Bahar G, Smorodinsky R, Tennenholtz M (2016) Economic recommendation systems. 16th ACM Conf. Electronic Commerce (EC) (ACM, New York), 757.Google Scholar
  • Bartók G, Foster DP, Pál D, Rakhlin A, Szepesvári C (2014) Partial monitoring - classification, regret bounds, and algorithms. Math. Oper. Res. 39(4):967–997.LinkGoogle Scholar
  • Bastani H, Bayati M, Khosravi K (2017) Mostly exploration-free algorithms for contextual bandits. Preprint, submitted April 28, https://arxiv.org/abs/1704.09011.Google Scholar
  • Bergemann D, Morris S (2016) Information design, Bayesian persuasion and Bayes correlated equilibrium. Amer. Econom. Rev. 106(5):586–591.Google Scholar
  • Bergemann D, Välimäki J (2010) The dynamic pivot mechanism. Econometrica 78(2):771–789.Google Scholar
  • Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.LinkGoogle Scholar
  • Bimpikis K, Papanastasiou Y, Savva N (2018) Crowdsourcing exploration. Management Sci. 64(4):1477–1973.Google Scholar
  • Bolton P, Harris C (1999) Strategic experimentation. Econometrica 67(2):349–374.CrossrefGoogle Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. (Now Publishers, Boston).Google Scholar
  • Bubeck S, Munos R, Stoltz G (2011) Pure exploration in multi-armed bandit problems. Theoret. Comput. Sci. 412(19):1832–1852.CrossrefGoogle Scholar
  • Celis LE, Vishnoi NK (2017) Fair personalization. Preprint, submitted July 7, https://arxiv.org/abs/1707.02260.Google Scholar
  • Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Che Y-K, Hörner J (2019) Optimal design for social learning. Quart. J. Econom. 133(2):871–925.Google Scholar
  • Chen B, Frazier PI, Kempe D (2018) Incentivizing exploration by heterogeneous users. Conf. Learn Theory (COLT), 798–818.Google Scholar
  • Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5(2):153–163.CrossrefGoogle Scholar
  • Chow S-C, Chang M (2008) Adaptive design methods in clinical trials – A review. Orphanet J. Rare Diseases 3:11.CrossrefGoogle Scholar
  • Detry, MA, Lewis RJ, Broglio KR, Connor JT, Berry SM, Berry DA (2012) Standards for the design, conduct, and evaluation of adaptive randomized clinical trials. Report, Patient-Centered Outcomes Research Institute (PCORI), Washington, DC.Google Scholar
  • Devanur N, Kakade SM (2009) The price of truthfulness for pay-per-click auctions. 10th ACM Conf. Electronic Commerce (EC), 99–106.Google Scholar
  • Doob JL (1949) Application of the theory of martingales. Le Calcul des Probabilités et ses Applications (Centre National de la Recherche Scientifique, Paris), 23–27.Google Scholar
  • Dudík M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal leanring for contextual bandits. 27th Conf. Uncertainty Artificial Intelligence (UAI) (AUAI Press, Arlington, VA), 169–178.Google Scholar
  • Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. Innovations Theoret. Comput. Sci. Conf. (ITCS) (ACM, New York), 214–226.Google Scholar
  • Ely J, Frankel A, Kamenica E (2015) Suspense and Surprise. J. Political Econom. 123(1):215–260.CrossrefGoogle Scholar
  • Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. 15th Conf. Learn. Theory (COLT) Lecture Notes in Computer Science, vol 2375 (Springer, Berlin, Heidelberg), 255–270.Google Scholar
  • Even-Dar E, Mannor S, Mansour Y (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7:1079–1105.Google Scholar
  • Frazier P, Kempe D, Kleinberg JM, Kleinberg R (2014) Incentivizing exploration. ACM Conf. Econom. Comput. (ACM EC) (ACM, New York), 5–22.Google Scholar
  • Freidlin B, Simon R (2005) Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clinical Cancer Res. 11(21):7872–7878.CrossrefGoogle Scholar
  • Freidlin B, Jiang W, Simon R (2007) Biomarker adaptive threshold design: A procedure for evaluating treatment with possible biomarker-defined subset effect. J. National Cancer Institute 99(13):1036–1043.CrossrefGoogle Scholar
  • Freidlin B, Jiang W, Simon R (2010) Cross-validated adaptive signature design. Clinical Cancer Res. 16(2):691–698.CrossrefGoogle Scholar
  • Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: Some design considerations. Clinical Cancer Res. 14(14):43–68.CrossrefGoogle Scholar
  • Garimella S (2015) New oncology clinical trial designs: What works and what doesn’t? Evidence-Based Oncology 21(10):349–350.Google Scholar
  • Ghosal S (1996) A review of consistency and convergence rates of posterior distribution. Proc. Varanashi Sympos. Bayesian Inference, Varanasi, India.Google Scholar
  • Ghosh A, Hummel P (2013) Learning and incentives in user-generated content: multi-armed bandits with endogenous arms. Innovations in Theoretical Computer Science Conf. (ITCS) (ACM, New York), 233–246.Google Scholar
  • Gittins JC (1979) Bandit processes and dynamic allocation indices (with discussion). J. Royal Statist. Soc. B 41(2):148–177.Google Scholar
  • Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons).CrossrefGoogle Scholar
  • Goel A, Khanna S, Null B (2009) The ratio index for budgeted learning, with applications. 20th ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 18–27.Google Scholar
  • Guha S, Munagala K (2007) Approximation algorithms for budgeted learning problems. 39th ACM Sympos. Theory of Computing (STOC) (ACM, New York), 104–113.Google Scholar
  • György A, Linder T, Lugosi G, Ottucsák G (2007) The on-line shortest path problem under partial monitoring. J. Machine Learn. Res. 8:2369–2403.Google Scholar
  • Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems (NIPS), vol. 29 (Curran Associates, Red Hook, NY), 3315–3323.Google Scholar
  • Hellmich M (2001) Monitoring clinical trials with multiple arms. Biometrics 57(3):892–898.CrossrefGoogle Scholar
  • Ho C-J, Slivkins A, Vaughan JW (2016) Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. J. Artificial Intelligence Res. 55:317–359.Google Scholar
  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13–30.CrossrefGoogle Scholar
  • Hörner J, Skrzypacz A (2015) Selling information. J. Political Econom. 24(6):1515–1562.Google Scholar
  • Immorlica N, Mao J, Slivkins A, Wu S (2019) Bayesian exploration with heterogenous agents. Internat. World Wide Web Conf. Committee (IW3C2), Geneva, Switzerland, 751–761.Google Scholar
  • Joseph M, Kearns M, Morgenstern JH, Roth A (2016) Fairness in learning: Classic and contextual bandits. Advances in Neural Information Processing Systems (NIPS), vol. 29 (Curran Associates, Red Hook, NY), 325–333..Google Scholar
  • Kakade SM, Lobel I, Nazerzadeh H (2013) Optimal dynamic mechanism design and the virtual-pivot mechanism. Oper. Res. 61(4):837–854.LinkGoogle Scholar
  • Kale S, Reyzin L, Schapire RE (2010) Non-stochastic bandit slate problems. Advances in Neural Information Processing Systems (NIPS), vol. 23 (Curran Associates, Red Hook, NY), 1054–1062.Google Scholar
  • Kamenica E, Gentzkow M (2011) Bayesian persuasion. Amer. Econom. Rev. 101(6):2590–2615.CrossrefGoogle Scholar
  • Kannan S, Kearns M, Morgenstern J, Pai M, Roth A, Vohra R, Wu ZS (2017) Fairness incentives for myopic agents. ACM Conf. Econom. Comput. (ACM EC) (ACM, New York), 369–386.Google Scholar
  • Kannan S, Morgenstern J, Roth A, Waggoner B, Wu ZS (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in Neural Information Processing Systems (NIPS), vol. 31 (Curran Associates, Red Hook, NY), 369–386.Google Scholar
  • Kearns M, Roth A, Wu ZS (2017) Meritocratic fairness for cross-population selection. Internat. Conf. Machine Learn. (ICML), 1828–1836.Google Scholar
  • Keller G, Rady S, Cripps M (2005) Strategic experimentation with exponential bandits. Econometrica 73(1):39–68.CrossrefGoogle Scholar
  • Kleinberg J, Mullainathan S, Raghavan M (2017) Inherent trade-offs in the fair determination of risk scores. Innovations Theoret. Comput. Sci. Conf. (ITCS), 43:1–43:23.Google Scholar
  • Kleinberg RD, Leighton FT (2003), The value of knowing a demand curve: Bounds on regret for online posted-price auctions. IEEE Sympos. Foundations Comput. Sci. (FOCS), (IEEE, Piscataway, NJ), 594–605.Google Scholar
  • Kleinberg RD, Waggoner B, Weyl EG (2016) Descending price optimally coordinates search. 17th ACM Conf. Econom. Comput. (ACM-EC) (ACM, New York), 23–24.Google Scholar
  • Kremer I, Mansour Y, Perry M (2014) Implementing the “wisdom of the crowd.” J. Political Econom. 122(5):988–1012.Google Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Advances in Neural Information Processing Systems (NIPS), vol. 20 (Curran Associates, Red Hook, NY), 325–333.Google Scholar
  • Liu Y, Radanovic G, Dimitrakakis C, Mandal D, Parkes DC (2017) Calibrated fairness in bandits. Preprint, submitted July 6, https://arxiv.org/abs/1707.01875.Google Scholar
  • Maitland ML, Schilsky RL (2011) Clinical trials in the era of personalized oncology. CA: Cancer J. Clinicians 61(6):365–381.CrossrefGoogle Scholar
  • Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.Google Scholar
  • Mansour Y, Slivkins A, Syrgkanis V, Wu S (2016) Bayesian exploration: Incentivizing exploration in Bayesian games. Preprint, submitted February 24, https://arxiv.org/abs/1602.07570.Google Scholar
  • Maron O, Moore AW (1993) Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems (NIPS), vol. 6 (Curran Associates, Red Hook, NY), 59–66.Google Scholar
  • Maron O, Moore AW (1997) The racing algorithm: Model selection for lazy learners. Artificial Intelligence Rev. 11(1–5):193–225.CrossrefGoogle Scholar
  • Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Parmar MKB, Carpenter J, Sydes MR (2014) More multiarm randomised trials of superiority are needed. Lancet 384(9940):283–284.CrossrefGoogle Scholar
  • Raghavan M, Slivkins A, Vaughan JW, Wu ZS (2018) The externalities of exploration and how data diversity helps exploitation. Conf. Learn. Theory (COLT), 1724–1738.Google Scholar
  • Rayo L, Segal I (2010) Optimal information disclosure. J. Political Econom. 118(5):949–987.CrossrefGoogle Scholar
  • Redig A, Jänne P (2015) Basket trials and the evolution of clinical trial design in an era of genomic medicine. J. Clinical Oncology 33(9):975–977.CrossrefGoogle Scholar
  • Schmit S, Riquelme C (2018) Human interaction with recommendation systems. Internat. Conf. Artificial Intelligence Statist. (AISTATS), 862–870.Google Scholar
  • Singla A, Krause A (2013) Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. 22nd Internat. World Wide Web Conf. (WWW), Geneva, Switzerland, 1167–1178.Google Scholar
  • Slivkins A, Vaughan JW (2013) Online decision making in crowdsourcing markets: Theoretical challenges. SIGecom Exchanges 12(2):4–23.Google Scholar
  • Taneva I (2016) Information design. Working paper, University of Edinburgh, Edinburgh, UK.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294.CrossrefGoogle Scholar
  • Vaidyanathan G (2012) Redefining clinical trials: The age of personalized medicine. Cell 148(6):1079–1080.CrossrefGoogle Scholar
  • Wang C-C, Kulkarni SR, Poor HV (2005) Bandit problems with side observations. IEEE Trans. Automat. Control. 50(3):338–355.CrossrefGoogle Scholar
  • Wen Z, Kveton B, Ashkan A (2015) Efficient learning in large-scale combinatorial semi-bandits. 32nd Internat. Conf. Machine Learn. (ICML), 1113–1122.Google Scholar
  • Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74(368):799–806.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.