Thompson Sampling for the Multinomial Logit Bandit

Shipra Agrawal
Corresponding Author
Shipra Agrawal
[email protected]
https://orcid.org/0000-0003-4486-3871
Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author
,
Vashist Avadhanula
Vashist Avadhanula
[email protected]
https://orcid.org/0000-0002-3045-691X
Decision, Risk and Operations, Columbia Business School, New York, New York 10027
Search for more papers by this author
,
Vineet Goyal
Vineet Goyal
[email protected]
https://orcid.org/0000-0001-6719-3212
Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author
,
Assaf Zeevi
Assaf Zeevi
[email protected]
https://orcid.org/0000-0003-1075-6664
Decision, Risk and Operations, Columbia Business School, New York, New York 10027
Search for more papers by this author

Shipra Agrawal

Corresponding Author

Shipra Agrawal

[email protected]

https://orcid.org/0000-0003-4486-3871

Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Vashist Avadhanula

[email protected]

https://orcid.org/0000-0002-3045-691X

Decision, Risk and Operations, Columbia Business School, New York, New York 10027

Search for more papers by this author

Vineet Goyal

[email protected]

https://orcid.org/0000-0001-6719-3212

Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Assaf Zeevi

[email protected]

https://orcid.org/0000-0003-1075-6664

Decision, Risk and Operations, Columbia Business School, New York, New York 10027

Search for more papers by this author

Published Online:6 Mar 2025https://doi.org/10.1287/moor.2020.0096

References

[1] Abramowitz M, Stegun IA (1964) Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables (US Government Printing Office, Washington, DC).Google Scholar
[2] Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. Proc. Sixteenth Internat. Conf. Artificial Intelligence Statist. (AISTATS), vol. 31 (JMLR W&CP, Scotsdale, AZ), 99–107.Google Scholar
[3] Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. 30th Internat. Conf. Machine Learn., vol. 28 (JMLR W&CP, Atlanta), 127–135.Google Scholar
[4] Agrawal S, Avadhanula V, Goyal V, Zeevi A (2016) A near-optimal exploration-exploitation approach for assortment selection. Proc. 2016 ACM Conf. Econom. Comput. (EC) (Association for Computing Machinery, New York), 599–600.Google Scholar
[5] Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3:397–422.Google Scholar
[6] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.Crossref, Google Scholar
[7] Avadhanula V, Bhandari J, Goyal V, Zeevi A (2016) On the tightness of an LP relaxation for rational optimization and its applications. Oper. Res. Lett. 44(5):612–617.Crossref, Google Scholar
[8] Ben-Akiva M, Lerman S (1985) Discrete Choice Analysis: Theory and Application to Travel Demand, vol. 9 (MIT Press, Cambridge, MA).Google Scholar
[9] Chen X, Krishnamurthy A, Wang Y (2023) Robust dynamic assortment optimization in the presence of outlier customers. Oper. Res. 72(3):999–1015.Google Scholar
[10] Chen X, Wang Y, Zhou Y (2020) Dynamic assortment optimization with changing contextual information. J. Machine Learn. Res. 21(1).Google Scholar
[11] Chen X, Wang Y, Zhou Y (2021) Dynamic assortment selection under the nested logit models. Production Oper. Management 30:85–102.Google Scholar
[12] Cheung W, Simchi-Levi D (2017) Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Preprint, submitted November 27, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3075658.Google Scholar
[13] Davis J, Gallego G, Topaloglu H (2013) Assortment planning under the multinomial logit model with totally unimodular constraint structures. Technical report, Cornell University, Ithaca, NY.Google Scholar
[14] Feng Y, Caldentey R, Christopher TR (2021) Robust learning of consumer preferences. Oper. Res. 70(2):918–962.Google Scholar
[15] Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (JMLR W&CP, Beijing), 100–108.Google Scholar
[16] Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Proc. 27th Internat. Conf. Machine Learn. (ICML) (Omnipress, Madison, WI), 13–20.Google Scholar
[17] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Proc. 23rd Internat. Conf. Algorithmic Learn. Theory (Springer-Verlag, Berlin, Heidelberg), 199–213.Google Scholar
[18] Korda N, Kaufmann E, Munos R (2013) Thompson sampling for 1-dimensional exponential family bandits. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY).Google Scholar
[19] Luce RD (1959) Individual Choice Behavior: A Theoretical Analysis (Wiley, New York).Google Scholar
[20] May BC, Korda N, Lee A, Leslie DS (2012) Optimistic Bayesian sampling in contextual-bandit problems. J. Machine Learn. Res. 13(1):2069–2106.Google Scholar
[21] McFadden D (1978) Modelling the Choice of Residential Location (Institute of Transportation Studies, University of California, Berkeley).Google Scholar
[22] Miao S, Chao X (2020) Dynamic joint assortment and pricing optimization with demand learning. Manufacturing Service Oper. Management 23(2):525–545.Google Scholar
[23] Miao S, Chao X (2019) Fast algorithms for online personalized assortment optimization in a big data regime. Preprint, submitted August 5, http://dx.doi.org/10.2139/ssrn.3432574.Google Scholar
[24] Oh M, Iyengar G (2019) Multinomial logit contextual bandits. Working paper, Columbia University, New York.Google Scholar
[25] Oh M, Iyengar G (2019) Thompson sampling for multinomial logit contextual bandits. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 3145–3155.Google Scholar
[26] Oliver C, Li L (2011) An empirical evaluation of Thompson sampling. Adv. Neural Inform. Processing Systems (NIPS), vol. 24 (Curran Associates, Inc., Red Hook, NY), 2249–2257.Google Scholar
[27] Plackett RL (1975) The analysis of permutations. J. Roy. Statist. Soc. Ser. C Appl. Statist. 24(2):193–202.Google Scholar
[28] Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
[29] Rusmevichientong P, Shen ZM, Shmoys DB (2010) Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Oper. Res. 58(6):1666–1680.Link, Google Scholar
[30] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
[31] Russo D, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A Tutorial on Thompson Sampling (Now Foundations and Trends, Norwell, MA).Crossref, Google Scholar
[32] Saha A, Gopalan A (2019) Regret minimisation in multinomial logit bandits. Preprint, submitted March 1, https://arxiv.org/abs/1903.00543v1.Google Scholar
[33] Sauré D, Zeevi A (2013) Optimal dynamic assortment planning with demand learning. Manufacturing Service Oper. Management 15(3):387–404.Link, Google Scholar
[34] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
[35] Train K (2003) Discrete Choice Methods with Simulation (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[36] Wang Y, Chen X, Zhou Y (2018) Near-optimal policies for dynamic multinomial logit assortment selection models. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 3101–3110.Google Scholar

cover image Mathematics of Operations Research

Volume 51, Issue 1

February 2026

Pages iv-viii, 1-851

Article Information

Metrics

Information

Received:April 02, 2020
Accepted:April 24, 2023
Published Online:March 06, 2025

Cite as

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi (2025) Thompson Sampling for the Multinomial Logit Bandit. Mathematics of Operations Research 51(1):568-590.

https://doi.org/10.1287/moor.2020.0096

Keywords

Acknowledgments

The abstract of the preliminary version of this work was published in the Proceedings of Conference on Learning Theory (COLT) 2017. The preprint of this document is available on arXiv at https://arxiv.org/pdf/1706.00977.pdf. The contents of this document are also included in the dissertation of Vashist Avadhanula, which is available at https://academiccommons.columbia.edu/doi/10.7916/d8-6c67-1n16.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Thompson Sampling for the Multinomial Logit Bandit

References

Volume 51, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News