Corruption-Robust Exploration in Episodic Reinforcement Learning

Published Online:https://doi.org/10.1287/moor.2021.0202

References

  • [1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
  • [2] Abbasi-Yadkori Y, Bartlett PL, Kanade V, Seldin Y, Szepesvari C (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 2508–2516.Google Scholar
  • [3] Agarwal A, Agarwal S, Patil P (2021) Stochastic dueling bandits with adversarial corruption. Feldman V, Ligett K, Sabato S, eds. Proc. 32nd Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 132 (ML Research Press, New York), 217–248.Google Scholar
  • [4] Amir I, Attias I, Koren T, Livni R, Mansour Y (2020) Prediction with corrupted expert advice. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 14315–14325.Google Scholar
  • [5] Audibert JY, Munos R, Szepesvári C (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19):1876–1902.CrossrefGoogle Scholar
  • [6] Auer P, Chiang C (2016) An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 49 (ML Research Press, New York), 116–120.Google Scholar
  • [7] Auer P, Cesa-Bianchi N, Fisher P (2002) Finite-time regret bounds for the multi-armed bandit problems. Machine Learn. 47:2–3.CrossrefGoogle Scholar
  • [8] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • [9] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org), 263–272.Google Scholar
  • [10] Bogunovic I, Krause A, Scarlett J (2020) Corruption-tolerant Gaussian process bandit optimization. Chiappa S, Calandra R, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist, Proceedings of Machine Learning Research, vol. 108 (ML Research Press, New York), 1071–1081.Google Scholar
  • [11] Bubeck S, Slivkins A (2012) The best of both worlds: Stochastic and adversarial bandits. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 23 (ML Research Press, New York), 42.1–42.23.Google Scholar
  • [12] Chen X, Wang Y (2023) Robust dynamic pricing with demand learning in the presence of outlier customers. Oper. Res. 71(4):1362–1386.LinkGoogle Scholar
  • [13] Chen Y, Du SS, Jamieson K (2021) Improved corruption robust algorithms for episodic reinforcement learning. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 1561–1570.Google Scholar
  • [14] Chen X, Krishnamurthy A, Wang Y (2023) Robust dynamic assortment optimization in the presence of outlier customers. Oper. Res., ePub ahead of print August 21, https://doi.org/10.1287/opre.2020.0281.LinkGoogle Scholar
  • [15] Cheung WC, Simchi-Levi D, Zhu R (2023) Nonstationary reinforcement learning: The blessing of (more) optimism. Management Sci. 69(10):5722–5739.LinkGoogle Scholar
  • [16] Dann C, Brunskill E (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Cortes C, Lee DD, Sugiyama M, Garnett R, eds. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2818–2826.Google Scholar
  • [17] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Red Hook, NY), 5713–5723.Google Scholar
  • [18] Domingues OD, Ménard P, Pirotta M, Kaufmann E, Valko M (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Banerjee A, Fukumizu K, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (ML Research Press, New York), 3538–3546.Google Scholar
  • [19] Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Proc. 33rd Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8058–8068.Google Scholar
  • [20] Even-Dar E, Kakade SM, Mansour Y (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.LinkGoogle Scholar
  • [21] Even-Dar E, Mannor S, Mansour Y (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7:1079–1105.Google Scholar
  • [22] Golrezaei N, Manshadi VH, Schneider J, Sekar S (2023) Learning product rankings robust to fake users. Oper. Res. 71(4):1171–1196.LinkGoogle Scholar
  • [23] Gupta A, Koren T, Talwar K (2019) Better algorithms for stochastic bandits with adversarial corruptions. Beygelzimer A, Hsu D, eds. Proc. 32nd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 99 (ML Research Press, New York), 1562–1578.Google Scholar
  • [24] He J, Zhou D, Gu Q (2021) Logarithmic regret for reinforcement learning with linear function approximation. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 4171–4180.Google Scholar
  • [25] Jabbari S, Joseph M, Kearns M, Morgenstern J, Roth A (2017) Fairness in reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1617–1626.Google Scholar
  • [26] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
  • [27] Jin T, Luo H (2020) Simultaneously learning stochastic and adversarial episodic MDPs with known transition. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 16557–16566.Google Scholar
  • [28] Jin T, Huang L, Luo H (2021) The best of both worlds: Stochastic and adversarial episodic MDPs with unknown transition. Annual Conf. Neural Inform. Processing Systems, Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 20491–20502.Google Scholar
  • [29] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
  • [30] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 125 (ML Research Press, New York), 2137–2143.Google Scholar
  • [31] Jin C, Jin T, Luo H, Sra S, Yu T (2020) Learning adversarial MDPs with bandit feedback and unknown transition. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 4860–4869.Google Scholar
  • [32] Kakade SM (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University College London, London.Google Scholar
  • [33] Krishnamurthy A, Lykouris T, Podimata C, Schapire RE (2023) Contextual search in the presence of adversarial corruptions. Oper. Res. 71(4):1120–1135.LinkGoogle Scholar
  • [34] Li Y, Lou EY, Shan L (2019) Stochastic linear optimization with adversarial corruption. Preprint, submitted September 4, https://arxiv.org/abs/1909.02109.Google Scholar
  • [35] Lykouris T, Mirrokni V, Paes Leme R (2018) Stochastic bandits robust to adversarial corruptions. Diakonikolas I, Kempe D, eds. Proc. 50th ACM Annual Sympos. Theory Comput. (Association for Computing Machinery, New York), 114–122.Google Scholar
  • [36] Lykouris T, Simchowitz M, Slivkins A, Sun W (2020) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted April 29, https://arxiv.org/pdf/1911.08689v2.Google Scholar
  • [37] Neu G, Olkhovskaya J (2021) Online learning in MDPs with linear function approximation and bandit feedback. Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Red Hook, NY), 10407–10417.Google Scholar
  • [38] Neu G, Pike-Burke C (2020) A unifying view of optimism in episodic reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 1392–1403.Google Scholar
  • [39] Neu G, Gyorgy A, Szepesvári C (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 22 (ML Research Press, New York), 805–813.Google Scholar
  • [40] Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 1804–1812.Google Scholar
  • [41] Rosenberg A, Mansour Y (2019) Online convex optimization in adversarial Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 5478–5486.Google Scholar
  • [42] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 14410–14420.Google Scholar
  • [43] Seldin Y, Lugosi G (2017) An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. Proc. 30th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 65 (ML Research Press, New York), 1743–1759.Google Scholar
  • [44] Seldin Y, Slivkins A (2014) One practical algorithm for both stochastic and adversarial bandits. Xing EP, Jebara T, eds. Proc. 31th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 32 (JMLR.org), 1287–1295.Google Scholar
  • [45] Simchowitz M, Jamieson K (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 1153–1162.Google Scholar
  • [46] Wang Y, Wang R, Du SS, Krishnamurthy A (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Ninth Internat. Conf. Learn. Representations (ICLR, Appleton, WI), 1–17.Google Scholar
  • [47] Wei C, Luo H (2018) More adaptive algorithms for adversarial bandits. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (ML Research Press, New York), 1263–1291.Google Scholar
  • [48] Wei CY, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 134 (ML Research Press, New York), 4300–4354.Google Scholar
  • [49] Wei C, Dann C, Zimmert J (2022) A model selection approach for corruption robust reinforcement learning. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 167 (ML Research Press, New York), 1043–1096.Google Scholar
  • [50] Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 10746–10756.Google Scholar
  • [51] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7304–7312.Google Scholar
  • [52] Zhang X, Chen Y, Zhu X, Sun W (2021) Robust policy gradient against strong data corruption. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 12391–12401.Google Scholar
  • [53] Zimin A, Neu G (2013) Online learning in episodic Markovian decision processes by relative entropy policy search. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Proc. 26th Internat. Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1583–1591.Google Scholar
  • [54] Zimmert J, Seldin Y (2021) Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. J. Machine Learn. Res. 22(1):1310–1358.Google Scholar
  • [55] Zimmert J, Luo H, Wei C (2019) Beating stochastic and adversarial semi-bandits optimally and simultaneously. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7683–7692.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.