Corruption-Robust Exploration in Episodic Reinforcement Learning
References
- [1] (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
- [2] (2013) Online learning in Markov decision processes with adversarially chosen transition probability distributions. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 2508–2516.Google Scholar
- [3] (2021) Stochastic dueling bandits with adversarial corruption. Feldman V, Ligett K, Sabato S, eds. Proc. 32nd Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 132 (ML Research Press, New York), 217–248.Google Scholar
- [4] (2020) Prediction with corrupted expert advice. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 14315–14325.Google Scholar
- [5] (2009) Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19):1876–1902.Crossref, Google Scholar
- [6] (2016) An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 49 (ML Research Press, New York), 116–120.Google Scholar
- [7] (2002) Finite-time regret bounds for the multi-armed bandit problems. Machine Learn. 47:2–3.Crossref, Google Scholar
- [8] (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
- [9] (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org), 263–272.Google Scholar
- [10] (2020) Corruption-tolerant Gaussian process bandit optimization. Chiappa S, Calandra R, eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist, Proceedings of Machine Learning Research, vol. 108 (ML Research Press, New York), 1071–1081.Google Scholar
- [11] (2012) The best of both worlds: Stochastic and adversarial bandits. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 23 (ML Research Press, New York), 42.1–42.23.Google Scholar
- [12] (2023) Robust dynamic pricing with demand learning in the presence of outlier customers. Oper. Res. 71(4):1362–1386.Link, Google Scholar
- [13] (2021) Improved corruption robust algorithms for episodic reinforcement learning. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 1561–1570.Google Scholar
- [14] (2023) Robust dynamic assortment optimization in the presence of outlier customers. Oper. Res., ePub ahead of print August 21, https://doi.org/10.1287/opre.2020.0281.Link, Google Scholar
- [15] (2023) Nonstationary reinforcement learning: The blessing of (more) optimism. Management Sci. 69(10):5722–5739.Link, Google Scholar
- [16] (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Cortes C, Lee DD, Sugiyama M, Garnett R, eds. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2818–2826.Google Scholar
- [17] (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Red Hook, NY), 5713–5723.Google Scholar
- [18] (2021) A kernel-based approach to non-stationary reinforcement learning in metric spaces. Banerjee A, Fukumizu K, eds. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (ML Research Press, New York), 3538–3546.Google Scholar
- [19] (2019) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Proc. 33rd Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 8058–8068.Google Scholar
- [20] (2009) Online Markov decision processes. Math. Oper. Res. 34(3):726–736.Link, Google Scholar
- [21] (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7:1079–1105.Google Scholar
- [22] (2023) Learning product rankings robust to fake users. Oper. Res. 71(4):1171–1196.Link, Google Scholar
- [23] (2019) Better algorithms for stochastic bandits with adversarial corruptions. Beygelzimer A, Hsu D, eds. Proc. 32nd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 99 (ML Research Press, New York), 1562–1578.Google Scholar
- [24] (2021) Logarithmic regret for reinforcement learning with linear function approximation. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 4171–4180.Google Scholar
- [25] (2017) Fairness in reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1617–1626.Google Scholar
- [26] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(April):1563–1600.Google Scholar
- [27] (2020) Simultaneously learning stochastic and adversarial episodic MDPs with known transition. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 16557–16566.Google Scholar
- [28] (2021) The best of both worlds: Stochastic and adversarial episodic MDPs with unknown transition. Annual Conf. Neural Inform. Processing Systems, Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 20491–20502.Google Scholar
- [29] (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
- [30] (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J, Agarwal S, eds. Proc. 33rd Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 125 (ML Research Press, New York), 2137–2143.Google Scholar
- [31] (2020) Learning adversarial MDPs with bandit feedback and unknown transition. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 4860–4869.Google Scholar
- [32] (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University College London, London.Google Scholar
- [33] (2023) Contextual search in the presence of adversarial corruptions. Oper. Res. 71(4):1120–1135.Link, Google Scholar
- [34] (2019) Stochastic linear optimization with adversarial corruption. Preprint, submitted September 4, https://arxiv.org/abs/1909.02109.Google Scholar
- [35] (2018) Stochastic bandits robust to adversarial corruptions. Diakonikolas I, Kempe D, eds. Proc. 50th ACM Annual Sympos. Theory Comput. (Association for Computing Machinery, New York), 114–122.Google Scholar
- [36] (2020) Corruption robust exploration in episodic reinforcement learning. Preprint, submitted April 29, https://arxiv.org/pdf/1911.08689v2.Google Scholar
- [37] (2021) Online learning in MDPs with linear function approximation and bandit feedback. Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates, Red Hook, NY), 10407–10417.Google Scholar
- [38] (2020) A unifying view of optimism in episodic reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 1392–1403.Google Scholar
- [39] (2012) The adversarial stochastic shortest path problem with unknown transition probabilities. Lawrence ND, Girolami M, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 22 (ML Research Press, New York), 805–813.Google Scholar
- [40] (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 1804–1812.Google Scholar
- [41] (2019) Online convex optimization in adversarial Markov decision processes. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 5478–5486.Google Scholar
- [42] (2019) Worst-case regret bounds for exploration via randomized value functions. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 14410–14420.Google Scholar
- [43] (2017) An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. Proc. 30th Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 65 (ML Research Press, New York), 1743–1759.Google Scholar
- [44] (2014) One practical algorithm for both stochastic and adversarial bandits. Xing EP, Jebara T, eds. Proc. 31th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 32 (JMLR.org), 1287–1295.Google Scholar
- [45] (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPs. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Red Hook, NY), 1153–1162.Google Scholar
- [46] (2021) Optimism in reinforcement learning with generalized linear function approximation. Proc. Ninth Internat. Conf. Learn. Representations (ICLR, Appleton, WI), 1–17.Google Scholar
- [47] (2018) More adaptive algorithms for adversarial bandits. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 75 (ML Research Press, New York), 1263–1291.Google Scholar
- [48] (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Belkin M, Kpotufe S, eds. Proc. 34th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 134 (ML Research Press, New York), 4300–4354.Google Scholar
- [49] (2022) A model selection approach for corruption robust reinforcement learning. Dasgupta S, Haghtalab N, eds. Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 167 (ML Research Press, New York), 1043–1096.Google Scholar
- [50] (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (ML Research Press, New York), 10746–10756.Google Scholar
- [51] (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7304–7312.Google Scholar
- [52] (2021) Robust policy gradient against strong data corruption. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (ML Research Press, New York), 12391–12401.Google Scholar
- [53] (2013) Online learning in episodic Markovian decision processes by relative entropy policy search. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Proc. 26th Internat. Conf. Neural Information Processing Systems (Curran Associates, Red Hook, NY), 1583–1591.Google Scholar
- [54] (2021) Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. J. Machine Learn. Res. 22(1):1310–1358.Google Scholar
- [55] (2019) Beating stochastic and adversarial semi-bandits optimally and simultaneously. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (ML Research Press, New York), 7683–7692.Google Scholar

