A General Framework for Bandit Problems Beyond Cumulative Objectives

Asaf Cassel
Corresponding Author
Asaf Cassel
[email protected]
https://orcid.org/0000-0003-3566-6948
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel;
Search for more papers by this author
,
Shie Mannor
Shie Mannor
[email protected]
Faculty of Electrical and Computer Engineering and Faculty of Industrial Engineering and Management, Technion, Israel Institute of Technology, Haifa 3200003, Israel;Nvidia Research, Tel Aviv 6777506, Israel;
Search for more papers by this author
,
Assaf Zeevi
Assaf Zeevi
[email protected]
https://orcid.org/0000-0003-1075-6664
Graduate School of Business, Columbia University, New York, New York 10027;Data Science Institute, Columbia University, New York, New York 10027
Search for more papers by this author

Asaf Cassel

Corresponding Author

Asaf Cassel

[email protected]

https://orcid.org/0000-0003-3566-6948

School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel;

Search for more papers by this author

Shie Mannor

[email protected]

Faculty of Electrical and Computer Engineering and Faculty of Industrial Engineering and Management, Technion, Israel Institute of Technology, Haifa 3200003, Israel;Nvidia Research, Tel Aviv 6777506, Israel;

Search for more papers by this author

Assaf Zeevi

[email protected]

https://orcid.org/0000-0003-1075-6664

Graduate School of Business, Columbia University, New York, New York 10027;Data Science Institute, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:6 Jan 2023https://doi.org/10.1287/moor.2022.1335

Abstract

The stochastic multiarmed bandit (MAB) problem is a common model for sequential decision problems. In the standard setup, a decision maker has to choose at every instant between several competing arms; each of them provides a scalar random variable, referred to as a “reward.” Nearly all research on this topic considers the total cumulative reward as the criterion of interest. This work focuses on other natural objectives that cannot be cast as a sum over rewards but rather, more involved functions of the reward stream. Unlike the case of cumulative criteria, in the problems we study here, the oracle policy, which knows the problem parameters a priori and is used to “center” the regret, is not trivial. We provide a systematic approach to such problems and derive general conditions under which the oracle policy is sufficiently tractable to facilitate the design of optimism-based (upper confidence bound) learning policies. These conditions elucidate an interesting interplay between the arm reward distributions and the performance metric. Our main findings are illustrated for several commonly used objectives, such as conditional value-at-risk, mean-variance trade-offs, Sharpe ratio, and more.

Funding: This work was partially funded by the Israel Science Foundation [Contract 2199/20] and by the European Community’s Seventh Framework Programme FP7/2007–2013 [Grant 306638 (Scaling Up Reinforcement Learning: Structure Learning, Skill Acquisition, and Reward Shaping)].

cover image Mathematics of Operations Research

Volume 48, Issue 4

November 2023

Pages 1811-2382, C2

Article Information

Metrics

Information

Received:November 01, 2020
Accepted:October 11, 2022
Published Online:January 06, 2023

Cite as

Asaf Cassel, Shie Mannor, Assaf Zeevi (2023) A General Framework for Bandit Problems Beyond Cumulative Objectives. Mathematics of Operations Research 48(4):2196-2232.

https://doi.org/10.1287/moor.2022.1335

Keywords

Acknowledgments

The authors thank Ron Amit, Guy Tennenholtz, Nir Baram, and Nadav Merlis for helpful discussions of this work. The authors also thank the anonymous reviewers for providing the examples in Section 6 and additional helpful comments that improved this work. A preliminary version of this work appeared at the Conference on Learning Theory 2018.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A General Framework for Bandit Problems Beyond Cumulative Objectives

Abstract

Volume 48, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News