Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Published Online:https://doi.org/10.1287/mnsc.2022.03394

References

  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates Inc., Red Hook, NY), 2312–2320.Google Scholar
  • Adams WJ, Yellen JL (1976) Commodity bundling and the burden of monopoly. Quart. J. Econom. 90(3):475–498.Google Scholar
  • Agrawal S, Devanur NR (2014) Bandits with concave rewards and convex knapsacks. Babaioff M, Conitzer V, Easley D, Nisan N, eds. Proc. 15th ACM Conf. Econom. Comput. (ACM, New York), 989–1006.Google Scholar
  • Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. Conf. Learn. Theory, vol. 39 (PMLR, New York), 1–26.Google Scholar
  • Agarwal A, Dahleh M, Shah D, Shen D (2023) Causal matrix completion. Neu G, Rosasco L, eds. Proc. 36th Ann. Conf. Learn. Theory (PMLR, New York), 3821–3826.Google Scholar
  • Allouah A, Bahamou A, Besbes O (2023) Optimal pricing with a single point. Management Sci. 69(10):5866–5882.LinkGoogle Scholar
  • Athey S, Bayati M, Doudchenko N, Imbens G, Khosravi K (2021) Matrix completion methods for causal panel data models. J. Amer. Statist. Assoc. 116(536):1716–1730.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Bajari P, Burdick B, Imbens GW, Masoero L, McQueen J, Richardson T, Rosen IM (2021) Multiple randomization designs. Preprint, submitted December 27, https://arxiv.org/abs/2112.13495.Google Scholar
  • Banerjee S, Sinclair SR, Tambe M, Xu L, Yu CL (2022) Artificial replay: A meta-algorithm for harnessing historical data in bandits. Preprint, submitted October 1, https://arxiv.org/abs/2210.00025.Google Scholar
  • Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • Bastani H, Harsha P, Perakis G, Singhvi D (2022) Learning personalized product recommendations with customer disengagement. Manufacturing Service Oper. Management 24(4):2010–2028.LinkGoogle Scholar
  • Bayati M, Hamidi N, Johari R, Khosravi K (2020) Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. Adv. Neural Inform. Processing Systems 33:1713–1723.Google Scholar
  • Berry DA, Chen RW, Zame A, Heath DC, Shepp LA (1997) Bandit problems with infinitely many arms. Ann. Statist. 25(5):2103–2116.Google Scholar
  • Besbes O, Mouchtaki O (2023) How big should your data really be? Data-driven newsvendor: Learning one sample at a time. Management Sci. 69(10):5848–5865.LinkGoogle Scholar
  • Bhargava HK (2022) The creator economy: Managing ecosystem supply, revenue sharing, and platform design. Management Sci. 68(7):5233–5251.LinkGoogle Scholar
  • Bonald T, Proutiere A (2013) Two-target algorithms for infinite-armed bandits with Bernoulli rewards. Adv. Neural Inform. Processing Systems 26:2184–2192.Google Scholar
  • Brooks S, Gelman A, Jones G, Meng X-L (2011) Handbook of Markov Chain Monte Carlo (CRC Press, Boca Raton, FL).CrossrefGoogle Scholar
  • Candes E, Recht B (2012) Exact matrix completion via convex optimization. Comm. ACM 55(6):111–119.CrossrefGoogle Scholar
  • Carpentier A, Valko M (2015) Simple regret for infinitely many armed bandits. Bach F, Blei D, eds. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1133–1141.Google Scholar
  • Chaudhuri AR, Kalyanakrishnan S (2018) Quantile-Regret Minimisation in Infinitely Many-Armed Bandits (UAI).Google Scholar
  • Chen Y (2015) Incoherence-optimal matrix completion. IEEE Trans. Inform. Theory 61(5):2909–2923.CrossrefGoogle Scholar
  • Chen Y, Chi Y, Fan J, Ma C, Yan Y (2020) Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM J. Optim. 30(4):3098–3121.CrossrefGoogle Scholar
  • Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 208–214.Google Scholar
  • Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Servedio RA, Zhang T, eds. Proc. Conf. Learn. Theory (Omnipress, Madison, WI), 355–366.Google Scholar
  • Davenport MA, Romberg J (2016) An overview of low-rank matrix recovery from incomplete observations. IEEE J. Selected Topics Signal Processing 10(4):608–622.CrossrefGoogle Scholar
  • Farias V, Li AA, Peng T (2022) Uncertainty quantification for low-rank matrix completion with heterogeneous and sub-exponential noise. Camps-Valls G, Ruiz FJR, Guyon I, eds. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1179–1189.Google Scholar
  • Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Processing Systems 23:586–594.Google Scholar
  • Geng T, Lin X, Nair HS (2020) Online evaluation of audiences for targeted advertising via bandit experiments. Proc. AAAI Conf. Artificial Intelligence 34(1):13273–13279.CrossrefGoogle Scholar
  • Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.LinkGoogle Scholar
  • Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 100–108.Google Scholar
  • Gupta V, Kallus N (2022) Data pooling in stochastic optimization. Management Sci. 68(3):1595–1615.LinkGoogle Scholar
  • Gupta V, Rusmevichientong P (2021) Small-data, large-scale linear optimization with uncertain objectives. Management Sci. 67(1):220–241.LinkGoogle Scholar
  • Hamidi N, Bayati M, Gupta K (2019) Personalizing many decisions with high-dimensional covariates. Adv. Neural Inform. Processing Systems 32:11469–11480.Google Scholar
  • Hastie T, Tibshirani R, Wainwright M (2015) Statistical Learning with Sparsity: The Lasso and Generalizations (Taylor & Francis, Boca Raton, FL).CrossrefGoogle Scholar
  • Johari R, Li H, Liskovich I, Weintraub GY (2022) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.LinkGoogle Scholar
  • Jun K-S, Willett R, Wright S, Nowak R (2019) Bilinear bandits with low-rank structure. Chaudhuri K, Salakhutdinov R, eds. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 3163–3172.Google Scholar
  • Kaelbling LP (1993) Learning in Embedded Systems (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Kallus N, Udell M (2020) Dynamic assortment personalization in high dimensions. Oper. Res. 68(4):1020–1037.LinkGoogle Scholar
  • Katariya S, Kveton B, Szepesvari C, Vernade C, Wen Z (2017) Stochastic rank-1 bandits. Singh A, Zhu J, eds. Proc. Artificial Intelligence Statist. (PMLR, New York), 392–401.Google Scholar
  • Katehakis MN, Robbins H (1995) Sequential choice from several populations. Proc. Natl. Acad. Sci. USA 92(19):8584.CrossrefGoogle Scholar
  • Keshavan RH, Montanari A, Oh S (2010) Matrix completion from a few entries. IEEE Trans. Informs. Theory 56(6):2980–2998.CrossrefGoogle Scholar
  • Keskin NB, Li Y, Sunar N (2024) Data-driven clustering and feature-based retail electricity pricing with smart meters. Oper. Res. 73(5):2636–2660.LinkGoogle Scholar
  • Kveton B, Szepesvári C, Rao A, Wen Z, Abbasi-Yadkori Y, Muthukrishnan S (2017) Stochastic low-rank bandits. Preprint, submitted December 13, https://arxiv.org/abs/1712.04644.Google Scholar
  • Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2020) Randomized exploration in generalized linear bandits. Proc. Internat. Conf. Artificial Intelligence Statist., 2066–2076.Google Scholar
  • Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091–1114.CrossrefGoogle Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, MA).CrossrefGoogle Scholar
  • Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 2071–2080.Google Scholar
  • Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
  • Lika B, Kolomvatsos K, Hadjiefthymiades S (2014) Facing the cold start problem in recommender systems. Expert Systems Appl. 41(4):2065–2073.CrossrefGoogle Scholar
  • Lops P, De Gemmis M, Semeraro G (2011) Content-based recommender systems: State of the art and trends. Ricci F, Rokach L, Shapira B, Kantor P, eds. Recommender Systems Handbook (Springer, Berlin), 73–105.CrossrefGoogle Scholar
  • Lu S (2019) Beyond A/B testing: Multi-armed bandit experiments. Medium (April 4), https://medium.com/data-science/beyond-a-b-testing-multi-armed-bandit-experiments-1493f709f804.Google Scholar
  • Lu Y, Meisami A, Tewari A (2021) Low-rank generalized linear bandit problems. Banerjee A, Fukumizu K, eds. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 460–468.Google Scholar
  • Lu X, Wen Z, Kveton B (2018) Efficient online recommendation via low-rank ensemble sampling. Pera S, Rendle S, Vlachos M, eds. Proc. 12th ACM Conf. Recommender Systems (ACM, New York), 460–464.Google Scholar
  • McAfee RP, McMillan J, Whinston MD (1989) Multiproduct monopoly, commodity bundling, and correlation of values. Quart. J. Econom. 104(2):371–383.CrossrefGoogle Scholar
  • Mersereau AJ, Rusmevichientong P, Tsitsiklis JN (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automated Control 54(12):2787–2802.CrossrefGoogle Scholar
  • Miao S, Chen X, Chao X, Liu J, Zhang Y (2022) Context-based dynamic pricing with online clustering. Production Oper. Management 31(9):3559–3575.CrossrefGoogle Scholar
  • Nakamura A (2015) A UCB-like strategy of collaborative filtering. Bui TD, Ho TB, eds. Proc. Asian Conf. Machine Learn. (PMLR, New York), 315–329.Google Scholar
  • Pal S, Jain P (2023) Online low rank matrix completion. Proc. 11th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.LinkGoogle Scholar
  • Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends® Machine Learn. 11(1):1–96. CrossrefGoogle Scholar
  • Sam T, Chen Y, Yu CL (2023) Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure. Proc. ACM Measurment Anal. Comput. Systems 7(2):1–60.CrossrefGoogle Scholar
  • Schmalensee R (1984) Gaussian demand and commodity bundling. J. Bus. 57(1):S211–S230.CrossrefGoogle Scholar
  • Stigler GJ (1963) United States v. Loew’s Inc.: A note on block-booking. Supreme Court Rev. 1963:152–157.CrossrefGoogle Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Trinh C, Kaufmann E, Vernade C, Combes R (2020) Solving Bernoulli rank-one bandits with unimodal Thompson sampling. Algorithmic Learning Theory (PMLR, New York), 862–889.Google Scholar
  • Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
  • Volkovs M, Yu G, Poutanen T (2017) Dropoutnet: Addressing cold start in recommender systems. Adv. Neural Inform. Processing Systems 30:4957–4966.Google Scholar
  • Wang Y, Yves Audibert J, Munos R (2009) Algorithms for infinitely many-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inform. Processing Systems, vol. 21 (Curran Associates, Inc., Red Hook, NY), 1729–1736.Google Scholar
  • Xi X, Yu CL, Chen Y (2023) Entry-specific bounds for low-rank matrix completion under highly non-uniform sampling. Proc. IEEE Internat. Sympos. Inform. Theory (IEEE, Piscataway, NJ), 2625–2630.Google Scholar
  • Xu K, Bastani H (2021) Multitask learning and bandits via robust statistics. Management Sci. 71(9):7752–7773.Google Scholar
  • Ye Z, Zhang DJ, Zhang H, Zhang R, Chen X, Xu Z (2023) Cold start to improve market thickness on online advertising platforms: Data-driven algorithms and field experiments. Management Sci. 69(7):3838–3860.LinkGoogle Scholar
  • Zhang M, Tang J, Zhang X, Xue X (2014) Addressing cold start in recommender systems: A semi-supervised co-training algorithm. Berkovsky S, Bennett PN, Murdock V, Moffat A, eds. Proc. 37th Internat. ACM SIGIR Conf. Res. Development Informat. Retrieval (ACM, New York), 73–82.Google Scholar
  • Zhang DJ, Hu M, Liu X, Wu Y, Li Y (2022) Netease cloud music data. Manufacturing Service Oper. Management 24(1):275–284.LinkGoogle Scholar
  • Zhou J, Hao B, Wen Z, Zhang J, Sun WW (2025) Stochastic low-rank tensor bandits for multi-dimensional online decision making. J. Amer. Statist. Assoc. 120(549):198–211.Google Scholar
  • Zhu Z, Li X, Wang M, Zhang A (2022) Learning markov models via low-rank optimization. Oper. Res. 70(4):2384–2398.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.