Open Access

Collaborative Learning and Decision Making on Pricing and Recommendation: A Simple Framework for Planning

Junyu Cao
Junyu Cao
[email protected]
https://orcid.org/0000-0001-9235-1411
McCombs School of Business, The University of Texas at Austin, Austin, Texas 78712
Search for more papers by this author

Junyu Cao

[email protected]

https://orcid.org/0000-0001-9235-1411

McCombs School of Business, The University of Texas at Austin, Austin, Texas 78712

Search for more papers by this author

Published Online:11 Nov 2025https://doi.org/10.1287/mnsc.2023.00320

References

Abbasi-Yadkori Y, Antos A, Szepesvári C (2009) Forced-exploration based algorithms for playing in stochastic linear bandits. Proc. COLT Workshop On-Line Learn. Limited Feedback, vol. 92, 236.Google Scholar
Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates Inc., Red Hook, NY), 2312–2320.Google Scholar
Agarwal A, Dudík M, Kale S, Langford J, Schapire R (2012) Contextual bandit learning with predictable rewards. Conf. Artificial Intelligence Statist., vol. 22 (PMLR, New York), 19–26.Google Scholar
Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1638–1646.Google Scholar
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
Alptekinoğlu A, Semple JH (2016) The exponomial choice model: A new alternative for assortment and price optimization. Oper. Res. 64(1):79–93.Link, Google Scholar
Ban GY, Keskin NB (2021) Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Sci. 67(9):5549–5568.Link, Google Scholar
Banerjee S, Sinclair SR, Tambe M, Xu L, Yu CL (2022) Artificial replay: A meta-algorithm for harnessing historical data in bandits. Preprint, submitted September 30, https://arxiv.org/abs/2210.00025.Google Scholar
Bastani H, Bayati M, Khosravi K (2021) Mostly exploration-free algorithms for contextual bandits. Management Sci. 67(3):1329–1349.Link, Google Scholar
Bayati M, Hamidi N, Johari R, Khosravi K (2020) Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. Adv. Neural Inform. Processing Systems 33:1713–1723.Google Scholar
Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
Bietti A, Agarwal A, Langford J (2021) A contextual bandit bake-off. J. Machine Learn. Res. 22(1):5928–5976.Google Scholar
Bird S, Barocas S, Crawford K, Diaz F, Wallach H (2016) Exploring or exploiting? Social and ethical implications of autonomous experimentation in AI. Proc. Workshop Fairness Accountability Transparency Machine Learn.Google Scholar
Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.Link, Google Scholar
Cao J, Gao R (2021) Contextual decision-making under parametric uncertainty and data-driven optimistic optimization. Preprint, submitted October 14, https://optimization-online.org/2021/10/8634/.Google Scholar
Cao J, Leng Y (2025) Adaptive data acquisition for personalized recommender systems with optimality guarantees on short-form video platforms. Management Sci., ePub ahead of print August 25, https://doi.org/10.1287/mnsc.2022.01130.Link, Google Scholar
Cao J, Sun W (2024) Tiered assortment: Optimization and online learning. Management Sci. 70(8):5481–5501.Link, Google Scholar
Chen N, Gallego G (2021) Nonparametric pricing analytics with customer covariates. Oper. Res. 69(3):974–984.Link, Google Scholar
Chen Y, Shi C (2019) Joint pricing and inventory management with strategic customers. Oper. Res. 67(6):1610–1627.Link, Google Scholar
Chen X, Simchi-Levi D (2012) Pricing and inventory management. The Oxford Handbook of Pricing Management, vol. 1, 784–824.Google Scholar
Chen X, Miao S, Wang Y (2023) Differential privacy in personalized pricing with nonparametric demand models. Oper. Res. 71(2):581–602.Link, Google Scholar
Chen B, Wang Y, Zhou Y (2024) Optimal policies for dynamic pricing and inventory control with nonparametric censored demands. Management Sci. 70(5):3362–3380.Link, Google Scholar
Chen X, Owen Z, Pixton C, Simchi-Levi D (2022) A statistical learning approach to personalization in revenue management. Management Sci. 68(3):1923–1937.Link, Google Scholar
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist., 208–214.Google Scholar
Cohen MC, Elmachtoub AN, Lei X (2022) Price discrimination with fairness constraints. Management Sci. 68(12):8536–8552.Link, Google Scholar
Cohen MC, Lobel I, Paes Leme R (2020) Feature-based dynamic pricing. Management Sci. 66(11):4921–4943.Link, Google Scholar
Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Proc. Conf. Learn. Theory.Google Scholar
den Boer AV, Keskin NB (2022) Dynamic pricing with demand learning and reference effects. Management Sci. 68(10):7112–7130.Google Scholar
den Boer AV, Zwart B (2014) Simultaneously learning and optimizing using controlled variance pricing. Management Sci. 60(3):770–783.Link, Google Scholar
den Boer AV, Zwart B (2015) Dynamic pricing and learning with finite inventories. Oper. Res. 63(4):965–978.Link, Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2019) Balanced linear contextual bandits. Proc. AAAI Conf. Artificial Intelligence, vol. 33, 3445–3453.Google Scholar
Dubey A, Pentland A (2020) Differentially-private federated linear bandits. Adv. Neural Inform. Processing Systems 33:6003–6014.Google Scholar
El Housni O, Topaloglu H (2022) Joint assortment optimization and customization under a mixture of multinomial logit models: On the value of personalized assortments. Oper. Res. 71(4):1197–1215.Google Scholar
Ettl M, Harsha P, Papush A, Perakis G (2020) A data-driven approach to personalized bundle pricing and recommendation. Manufacturing Service Oper. Management 22(3):461–480.Link, Google Scholar
Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using Thompson Sampling. Oper. Res. 66(6):1586–1602.Link, Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 586–594.Google Scholar
Foster D, Rakhlin A (2020) Beyond UCB: Optimal and efficient contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 3199–3210.Google Scholar
Foster D, Agarwal A, Dudik M, Luo H, Schapire R (2018) Practical contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1539–1548.Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser. B (Methodological) 41(2):148–164.Crossref, Google Scholar
Hu P, Shum S, Yu M (2016) Joint inventory and markdown management for perishable goods with strategic consumer behavior. Oper. Res. 64(1):118–134.Link, Google Scholar
Huang R, Wu W, Yang J, Shen C (2021) Federated linear contextual bandits. Adv. Neural Inform. Processing Systems 34:27057–27068.Google Scholar
Jagabathula S, Rusmevichientong P (2017) A nonparametric joint assortment and price choice model. Management Sci. 63(9):3128–3145.Link, Google Scholar
Javanmard A, Nazerzadeh H (2019) Dynamic pricing in high-dimensions. J. Machine Learn. Res. 20(1):315–363.Google Scholar
Jun KS, Willett R, Wright S, Nowak R (2019) Bilinear bandits with low-rank structure. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 3163–3172.Google Scholar
Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, et al. (2021) Advances and open problems in federated learning. Foundations Trends Machine Learn. 14(1–2):1–210.Crossref, Google Scholar
Kallus N, Udell M (2020) Dynamic assortment personalization in high dimensions. Oper. Res. 68(4):1020–1037.Link, Google Scholar
Kannan S, Morgenstern JH, Roth A, Waggoner B, Wu ZS (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY), 2227–2236.Google Scholar
Kao H, Wei CY, Subramanian V (2022) Decentralized cooperative reinforcement learning with hierarchical information structure. Proc. Internat. Conf. Algorithmic Learn. Theory (PMLR, New York), 573–605.Google Scholar
Keskin NB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
Keskin NB, Zeevi A (2017) Chasing demand: Learning and earning in a changing environment. Math. Oper. Res. 42(2):277–307.Link, Google Scholar
Keskin NB, Zeevi A (2018) On incomplete learning and certainty-equivalence control. Oper. Res. 66(4):1136–1167.Link, Google Scholar
Keskin NB, Li Y, Sunar N (2024) Data-driven clustering and feature-based retail electricity pricing with smart meters. Oper. Res. 73(5):2636–2660.Link, Google Scholar
Kök AG, Xu Y (2011) Optimal and competitive assortments with endogenous pricing under hierarchical consumer choice models. Management Sci. 57(9):1546–1563.Link, Google Scholar
Krishnamurthy A, Langford J, Slivkins A, Zhang C (2020) Contextual bandits with continuous actions: Smoothing, zooming, and adapting. J. Machine Learn. Res. 21(1):5402–5446.Google Scholar
Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2020) Randomized exploration in generalized linear bandits. Proc. Internat. Conf. Artificial Intelligence Statist., 2066–2076.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 2071–2080.Google Scholar
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web, 61–670.Google Scholar
Li T, Sahu AK, Talwalkar A, Smith V (2020) Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37(3):50–60.Crossref, Google Scholar
Liu X (2023) Dynamic coupon targeting using batch deep reinforcement learning: An application to livestream shopping. Marketing Sci. 42(4):637–658.Link, Google Scholar
Lu Y, Meisami A, Tewari A (2021) Low-rank generalized linear bandit problems. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 460–468.Google Scholar
McMahan B, Moore E, Ramage D, Hampson S, Agüera y Arcas B (2017) Communication-efficient learning of deep networks from decentralized data. Artificial Intelligence and Statistics (PMLR, New York), 1273–1282.Google Scholar
Mersereau AJ, Rusmevichientong P, Tsitsiklis JN (2009) A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automated Control 54(12):2787–2802.Crossref, Google Scholar
Miao S, Chao X (2021) Dynamic joint assortment and pricing optimization with demand learning. Manufacturing Service Oper. Management 23(2):525–545.Google Scholar
Nambiar M, Simchi-Levi D, Wang H (2019) Dynamic learning and pricing with model misspecification. Management Sci. 65(11):4980–5000.Link, Google Scholar
Perivier N, Goyal V (2022) Dynamic pricing and assortment under a contextual MNL demand. Adv. Neural Inform. Processing Systems 35:3461–3474.Google Scholar
Qiang S, Bayati M (2016) Dynamic pricing with demand covariates. Preprint, submitted April 25, https://arxiv.org/abs/1604.07463.Google Scholar
Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
Shen M, Tang CS, Wu D, Yuan R, Zhou W (2020) JD.com: Transaction-level data for the 2020 MSOM data driven research challenge. Manufacturing Service Oper. Management 26(1):2–10.Link, Google Scholar
Shi C, Shen C (2021) Federated multi-armed bandits. Proc. AAAI Conf. Artificial Intelligence, vol. 35, 9603–9611.Google Scholar
Shin D, Vaccari S, Zeevi A (2022) Dynamic pricing with online reviews. Management Sci. 69(2):824–845.Google Scholar
Simchi-Levi D, Xu Y (2021) Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Math. Oper. Res. 47(3):1904–1931.Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Crossref, Google Scholar
Sun H, Li X, Teo C-P (2025) Partition and prosper: Design and pricing of single bundle. Oper. Res. 73(4):1983–2001.Google Scholar
Valko M, Korda N, Munos R, Flaounas I, Cristianini N (2013) Finite-time analysis of kernelised contextual bandits. Preprint, submitted September 26, https://arxiv.org/abs/1309.6869.Google Scholar
Wang R (2012) Capacitated assortment and price optimization under the multinomial logit model. Oper. Res. Lett. 40(6):492–497.Crossref, Google Scholar
Wang CC, Kulkarni SR, Poor HV (2005) Bandit problems with side observations. IEEE Trans. Automated Control 50(3):338–355.Crossref, Google Scholar
Wang H, Talluri K, Li X (2025) Technical note—On dynamic pricing with covariates. Oper. Res. 73(4):1932–1943.Link, Google Scholar
Wang M, Zhang H, Rusmevichientong P, Shen M (2024) Optimizing offline product design and online assortment policy: Measuring the relative impact of each decision. Management Sci. 71(5):4266–4286.Link, Google Scholar
Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Stat. Assoc. 74(368):799–806.Crossref, Google Scholar
Xu Y, Zeevi A (2020) Upper counterfactual confidence bounds: A new optimism principle for contextual bandits. Preprint, submitted July 15, https://arxiv.org/abs/2007.07876.Google Scholar
Zhang H, Cheng L (2015) Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization. Optim. Lett. 9:961–979.Crossref, Google Scholar
Zhou D, Li L, Gu Q (2020) Neural contextual bandits with UCB-based exploration. Internat. Conf. Machine Learn. (PMLR, New York), 11492–11502.Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:January 28, 2023
Accepted:April 29, 2025
Published Online:November 11, 2025

Cite as

Junyu Cao (2025) Collaborative Learning and Decision Making on Pricing and Recommendation: A Simple Framework for Planning. Management Science 0(0).

https://doi.org/10.1287/mnsc.2023.00320

Keywords

Acknowledgments

The author thanks J. George Shanthikumar (department editor), the associate editor, and three anonymous referees for constructive review comments and Kai Yin from Expedia Group for insightful discussions.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Collaborative Learning and Decision Making on Pricing and Recommendation: A Simple Framework for Planning

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News