[1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
[2] Abbasi-Yadkori Y, Bartlett PL, Kanade V, Seldin Y, Szepesvari C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 2508–2516.Google Scholar
[3] Agarwal A, Agarwal S, Patil P (2021) Stochastic dueling bandits with adversarial corruption. Feldman V, Ligett K, Sabato S, eds. Proc. 32nd Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 132 (ML Research Press, New York), 217–248.Google Scholar
[4] Amir I, Attias I, Koren T, Livni R, Mansour Y (2020) Prediction with corrupted expert advice. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 14315–14325.Google Scholar
[5] Audibert JY, Munos R, Szepesvári C (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19):1876–1902.Crossref, Google Scholar
[6] Auer P, Chiang C (2016) An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 49 (ML Research Press, New York), 116–120.Google Scholar
[7] Auer P, Cesa-Bianchi N, Fisher P (2002) Finite-time regret bounds for the multi-armed bandit problems. Machine Learn. 47:2–3.Crossref, Google Scholar
[8] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
[9] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org), 263–272.Google Scholar
[10] Bogunovic I, Krause A, Scarlett J (2020) Corruption-tolerant Gaussian process bandit optimization. Chiappa S, Calandra R, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist, Proceedings of Machine Learning Research, vol. 108 (ML Research Press, New York), 1071–1081.Google Scholar
[11] Bubeck S, Slivkins A (2012) The best of both worlds: Stochastic and adversarial bandits. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 23 (ML Research Press, New York), 42.1–42.23.Google Scholar
[12] Chen X, Wang Y (2023) Robust dynamic pricing with demand learning in the presence of outlier customers. Oper. Res. 71(4):1362–1386.Link, Google Scholar
[13] Chen Y, Du SS, Jamieson K (2021) Improved corruption robust algorithms for episodic reinforcement learning. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 1561–1570.Google Scholar
[14] Chen X, Krishnamurthy A, Wang Y (2023) Robust dynamic assortment optimization in the presence of outlier customers. Oper. Res., ePub ahead of print August 21, https://doi.org/10.1287/opre.2020.0281.Link, Google Scholar
[15] Cheung WC, Simchi-Levi D, Zhu R (2023) Nonstationary reinforcement learning: The blessing of (more) optimism. Management Sci. 69(10):5722–5739.Link, Google Scholar
[16] Dann C, Brunskill E (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Cortes C, Lee DD, Sugiyama M, Garnett R, eds. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2818–2826.Google Scholar
[17] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Red Hook, NY), 5713–5723.Google Scholar
[18] Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Banerjee A, Fukumizu K, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (ML Research Press, New York), 3538–3546.Google Scholar
[19] Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Proc. 33rd Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8058–8068.Google Scholar
[20] Even-Dar E, Kakade SM, Mansour Y (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.Link, Google Scholar
[21] Even-Dar E, Mannor S, Mansour Y (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7:1079–1105.Google Scholar
[22] Golrezaei N, Manshadi VH, Schneider J, Sekar S (2023) Learning product rankings robust to fake users. Oper. Res. 71(4):1171–1196.Link, Google Scholar
[23] Gupta A, Koren T, Talwar K (2019) Better algorithms for stochastic bandits with adversarial corruptions. Beygelzimer A, Hsu D, eds. Proc. 32nd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 99 (ML Research Press, New York), 1562–1578.Google Scholar
[24] He J, Zhou D, Gu Q (2021) Logarithmic regret for reinforcement learning with linear function approximation. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 4171–4180.Google Scholar
[25] Jabbari S, Joseph M, Kearns M, Morgenstern J, Roth A (2017) Fairness in reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1617–1626.Google Scholar
[26] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
[27] Jin T, Luo H (2020) Simultaneously learning stochastic and adversarial episodic MDPs with known transition. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 16557–16566.Google Scholar
[28] Jin T, Huang L, Luo H (2021) The best of both worlds: Stochastic and adversarial episodic MDPs with unknown transition. Annual Conf. Neural Inform. Processing Systems, Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 20491–20502.Google Scholar
[29] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
[30] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 125 (ML Research Press, New York), 2137–2143.Google Scholar
[31] Jin C, Jin T, Luo H, Sra S, Yu T (2020) Learning adversarial MDPs with bandit feedback and unknown transition. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 4860–4869.Google Scholar
[32] Kakade SM (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University College London, London.Google Scholar
[33] Krishnamurthy A, Lykouris T, Podimata C, Schapire RE (2023) Contextual search in the presence of adversarial corruptions. Oper. Res. 71(4):1120–1135.Link, Google Scholar
[34] Li Y, Lou EY, Shan L (2019) Stochastic linear optimization with adversarial corruption. Preprint, submitted September 4, https://arxiv.org/abs/1909.02109.Google Scholar
[35] Lykouris T, Mirrokni V, Paes Leme R (2018) Stochastic bandits robust to adversarial corruptions. Diakonikolas I, Kempe D, eds. Proc. 50th ACM Annual Sympos. Theory Comput. (Association for Computing Machinery, New York), 114–122.Google Scholar
[36] Lykouris T, Simchowitz M, Slivkins A, Sun W (2020) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted April 29, https://arxiv.org/pdf/1911.08689v2.Google Scholar
[37] Neu G, Olkhovskaya J (2021) Online learning in MDPs with linear function approximation and bandit feedback. Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Red Hook, NY), 10407–10417.Google Scholar
[38] Neu G, Pike-Burke C (2020) A unifying view of optimism in episodic reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 1392–1403.Google Scholar
[39] Neu G, Gyorgy A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 22 (ML Research Press, New York), 805–813.Google Scholar
[40] Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 1804–1812.Google Scholar
[41] Rosenberg A, Mansour Y (2019) Online convex optimization in adversarial Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 5478–5486.Google Scholar
[42] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 14410–14420.Google Scholar
[43] Seldin Y, Lugosi G (2017) An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. Proc. 30th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 65 (ML Research Press, New York), 1743–1759.Google Scholar
[44] Seldin Y, Slivkins A (2014) One practical algorithm for both stochastic and adversarial bandits. Xing EP, Jebara T, eds. Proc. 31th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 32 (JMLR.org), 1287–1295.Google Scholar
[45] Simchowitz M, Jamieson K (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 1153–1162.Google Scholar
[46] Wang Y, Wang R, Du SS, Krishnamurthy A (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Ninth Internat. Conf. Learn. Representations (ICLR, Appleton, WI), 1–17.Google Scholar
[47] Wei C, Luo H (2018) More adaptive algorithms for adversarial bandits. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (ML Research Press, New York), 1263–1291.Google Scholar
[48] Wei CY, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 134 (ML Research Press, New York), 4300–4354.Google Scholar
[49] Wei C, Dann C, Zimmert J (2022) A model selection approach for corruption robust reinforcement learning. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 167 (ML Research Press, New York), 1043–1096.Google Scholar
[50] Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 10746–10756.Google Scholar
[51] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7304–7312.Google Scholar
[52] Zhang X, Chen Y, Zhu X, Sun W (2021) Robust policy gradient against strong data corruption. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 12391–12401.Google Scholar
[53] Zimin A, Neu G (2013) Online learning in episodic Markovian decision processes by relative entropy policy search. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Proc. 26th Internat. Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1583–1591.Google Scholar
[54] Zimmert J, Seldin Y (2021) Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. J. Machine Learn. Res. 22(1):1310–1358.Google Scholar
[55] Zimmert J, Luo H, Wei C (2019) Beating stochastic and adversarial semi-bandits optimally and simultaneously. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7683–7692.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 2

May 2025

Pages iii, 783-1583

Article Information

Supplemental Material

Metrics

Information

Received:August 15, 2021
Accepted:October 08, 2023
Published Online:May 23, 2024

Cite as

Thodoris Lykouris; , Max Simchowitz; , Aleksandrs Slivkins; , Wen Sun (2024) Corruption-Robust Exploration in Episodic Reinforcement Learning. Mathematics of Operations Research 50(2):1277-1304.

https://doi.org/10.1287/moor.2021.0202

Keywords

Acknowledgments

The authors thank Christina Lee Yu, Sean Sinclair, and Éva Tardos for useful discussions that helped improve the presentation of this paper, as well as the anonymous review teams at the 34th Annual Conference on Learning Theory and Mathematics of Operations Research for their valuable feedback. A previous version of this paper appeared on arXiv.org in November 2019, focusing on tabular RL (Theorem 2.1). The extension to linear MDPs (Theorem 2.2) was added in April 2020. A one-page abstract appeared in the Proceedings of the 34th Annual Conference on Learning Theory. All results were obtained while T. Lykouris and W. Sun were postdocs and M. Simchowitz was an intern at Microsoft Research Lab–New York City.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Corruption-Robust Exploration in Episodic Reinforcement Learning

References

Volume 50, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News