Uncertainty Quantification and Exploration for Reinforcement Learning

Published Online:https://doi.org/10.1287/opre.2023.2436

References

  • Achiam J, Held D, Tamar A, Abbeel P (2017) Constrained policy optimization. Proc. 34th Internat. Conf. Machine Learn. (JMLR), 70:22–31.Google Scholar
  • Altman E (1999) Constrained Markov Decision Processes, vol. 7 (CRC Press, Boca Raton, FL).Google Scholar
  • Amini A, Gilitschenski I, Phillips J, Moseyko J, Banerjee R, Karaman S, Rus D (2020) Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot. Autom. Lett. 5(2):1143–1150.CrossrefGoogle Scholar
  • Audibert JY, Bubeck S (2010) Best arm identification in multi-armed bandits. 23rd Annual Conf. Learn. Theory (COLT 2010), 41–53.Google Scholar
  • Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 263–272.Google Scholar
  • Barton RR (2012) Tutorial: Input uncertainty in output analysis. Laroque C, Himmelspach J, Pasupathy R, Rose O, Uhrmacher A, eds. Proc. 2012 Winter Simulation Conf. (IEEE, Piscataway, NJ), 1–12.Google Scholar
  • Barton RR, Nelson BL, Xie W (2013) Quantifying input uncertainty via simulation confidence intervals. INFORMS J. Comput. 26(1):74–87.LinkGoogle Scholar
  • Barton RR, Schruben LW (2001) Resampling methods for input modeling. Peters BA, Smith JS, Medeiros DJ, Rohrer MW, eds. Proc. 2001 Winter Simulation Conf. (IEEE, Piscataway, NJ), 1:372–378.Google Scholar
  • Bayraksan G, Morton DP (2006) Assessing solution quality in stochastic programs. Math. Programming 108(2–3):495–514.CrossrefGoogle Scholar
  • Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 449–458.Google Scholar
  • Boutilier C, Lu T (2016) Budget allocation using weakly coupled, constrained Markov decision processes. Proc. Thirty-Second Conf. Uncertainty Artificial Intelligence (AUAI Press), 52–61.Google Scholar
  • Chen CH, He D, Fu M (2006) Efficient dynamic simulation allocation in ordinal optimization. IEEE Trans. Automat. Control. 51(12):2005–2009.CrossrefGoogle Scholar
  • Chen CH, Lee LH (2011) Stochastic Simulation Optimization: An Optimal Computing Budget Allocation, vol. 1 (World Scientific, Singapore).Google Scholar
  • Chen S, Devraj A, Busic A, Meyn S (2020) Explicit mean-square error bounds for Monte-Carlo and linear stochastic approximation. Internat. Conf. Artificial Intelligence Statist. (PMLR), 4173–4183.Google Scholar
  • Chen W, Gao S, Chen CH, Shi L (2013) An optimal sample allocation strategy for partition-based random search. IEEE Trans. Autom. Sci. Engrg. 11(1):177–186.CrossrefGoogle Scholar
  • Cheng RC, Holland W (1997) Sensitivity of computer simulation experiments to errors in input data. J. Statist. Comput. Simul. 57(1–4):219–241.CrossrefGoogle Scholar
  • Cheng RC, Holland W (2004) Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14(4):344–362.CrossrefGoogle Scholar
  • Chick SE (2001) Input distribution selection for simulation experiments: Accounting for input uncertainty. Oper. Res. 49(5):744–758.LinkGoogle Scholar
  • Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learn. Res. 18(1):6070–6120.Google Scholar
  • Corso A, Moss RJ, Koren M, Lee R, Kochenderfer MJ (2020) A survey of algorithms for black-box safety validation. Preprint, submitted May 6, https://doi.org/10.48550/arXiv.2005.02979.Google Scholar
  • Devraj AM, Meyn SP (2017) Zap Q-learning. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 2232–2241.Google Scholar
  • Dong J, Zhu Y (2016) Three asymptotic regimes for ranking and selection with general sample distributions. Proc. 2016 Winter Simulation Conf. (IEEE, Piscataway, NJ), 277–288.Google Scholar
  • Feinberg EA, Rothblum UG (2012) Splitting randomized stationary policies in total-reward Markov decision processes. Math. Oper. Res. 37(1):129–153.LinkGoogle Scholar
  • Gao S, Chen W (2016) Efficient feasibility determination with multiple performance measure constraints. IEEE Trans. Automat. Control. 62(1):113–122.CrossrefGoogle Scholar
  • Gao S, Xiao H, Zhou E, Chen W (2017) Robust ranking and selection with optimal computing budget allocation. Automatica J. IFAC. 81:30–36.CrossrefGoogle Scholar
  • Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Proc. 36th Winter Simulation Conf., 577–585.Google Scholar
  • Gordon GJ (1995) Stable function approximation in dynamic programming. Machine Learn. Proc. (Elsevier), 261–268.Google Scholar
  • Higle JL, Sen S (2013) Stochastic Decomposition: A Statistical Method for Large Scale Stochastic Linear Programming, vol. 8 (Springer Science & Business Media).Google Scholar
  • Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
  • Jia QS (2012) Efficient computing budget allocation for simulation-based policy improvement. IEEE Trans. Autom. Sci. Engrg. 9(2):342–352.CrossrefGoogle Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inf. Process. Syst. 31:4863–4873.Google Scholar
  • Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London.Google Scholar
  • Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, et al. (2018) QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. Preprint, submitted June 27, https://doi.org/10.48550/arxiv.1806.10293.Google Scholar
  • Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
  • Kearns M, Singh S (1998) Finite-sample convergence rates for Q-learning and indirect algorithms. Proc. Conf. Adv. Neural Inform. Processing Systems II, 996–1002.Google Scholar
  • Kim SH, Nelson BL (2007) Recent advances in ranking and selection. 2007 Winter Simulation Conf. (IEEE, Piscataway, NJ), 162–172.Google Scholar
  • Kiran BR, Sobh I, Talpaert V, Mannion P, Sallab AAA, Yogamani S, Pérez P (2020) Deep reinforcement learning for autonomous driving: A survey. Preprint, submitted February 2, https://doi.org/10.48550/arxiv.2002.00444.Google Scholar
  • Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014) Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat. 8(1):1225.Google Scholar
  • Lam H (2016) Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. 2016 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 178–192.Google Scholar
  • Lee LH, Pujowidianto NA, Li LW, Chen CH, Yap CM (2012) Approximate simulation budget allocation for selecting the best design in the presence of stochastic constraints. IEEE Trans. Automat. Control. 57(11):2940–2945.CrossrefGoogle Scholar
  • Mak WK, Morton DP, Wood RK (1999) Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24(1–2):47–56.CrossrefGoogle Scholar
  • Mannor S, Simester D, Sun P, Tsitsiklis JN (2004) Bias and variance in value function estimation. Proc. 21st Internat. Conf. Machine Learn. (ACM), 72.Google Scholar
  • Mannor S, Simester D, Sun P, Tsitsiklis JN (2007) Bias and variance approximation in value function estimates. Management Sci. 53(2):308–322.LinkGoogle Scholar
  • Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
  • Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Adv. Neural Inf. Process. Syst. 26:3003–3011.Google Scholar
  • Peng Y, Chen CH, Fu MC, Hu JQ (2017) Gradient-based myopic allocation policy: An efficient sampling procedure in a low-confidence scenario. IEEE Trans. Automat. Control. 63(9):3091–3097.CrossrefGoogle Scholar
  • Peng Y, Chen CH, Fu MC, Hu JQ, Ryzhov IO (2020) Efficient sampling allocation procedures for optimal quantile selection. INFORMS J. Comput. 33(1):230–245.LinkGoogle Scholar
  • Peng Y, Chong EK, Chen CH, Fu MC (2018) Ranking and selection as stochastic control. IEEE Trans. Automat. Control. 63(8):2359–2373.CrossrefGoogle Scholar
  • Putta SR, Tulabandhula T (2017) Pure exploration in episodic fixed-horizon Markov decision processes. AAMAS, 1703–1704.Google Scholar
  • Russo D (2016) Simple Bayesian algorithms for best arm identification. Conf. Learn. Theory, 1417–1418.Google Scholar
  • Ryzhov IO (2018) The local time method for targeting and selection. Oper. Res. 66(5):1406–1422.LinkGoogle Scholar
  • Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.LinkGoogle Scholar
  • Schöner H (2017) The role of simulation in development and testing of autonomous vehicles. Driving Simulation Conf., Stuttgart.Google Scholar
  • Serfling RJ (2009) Approximation Theorems of Mathematical Statistics, vol. 162. (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inf. Process. Syst. 31:3111–3121.Google Scholar
  • Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).CrossrefGoogle Scholar
  • Shen H, Hong LJ, Zhang X (2021) Ranking and selection with covariates for personalized decision making. INFORMS J. Comput. 33(4):1259–1684.Google Scholar
  • Shin D, Broadie M, Zeevi A (2016) Tractable sampling strategies for quantile-based ordinal optimization. 2016 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 847–858.Google Scholar
  • Song E, Nelson BL (2015) Quickly assessing contributions to input uncertainty. IIE Trans. 47(9):893–909.CrossrefGoogle Scholar
  • Song E, Nelson BL, Pegden CD (2014) Advanced tutorial: Input uncertainty quantification. Tolk A, Diallo S, Ryzhov I, Yilmaz L, Buckley S, Miller J, eds. Proc. 2014 Winter Simulation Conf. (IEEE, Piscataway, NJ), 162–176.Google Scholar
  • Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74(8):1309–1331.CrossrefGoogle Scholar
  • Xie W, Nelson BL, Barton RR (2014) A Bayesian framework for quantifying uncertainty in stochastic simulation. Oper. Res. 62(6):1439–1452.LinkGoogle Scholar
  • Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Internat. Conf. Machine Learn. (PMLR), 10746–10756.Google Scholar
  • Yi Y, Xie W (2017) An efficient budget allocation approach for quantifying the impact of input uncertainty in stochastic simulation. ACM Trans. Model. Comput. Simul. 27(4):25.CrossrefGoogle Scholar
  • Zhu H, Liu T, Zhou E (2020) Risk quantification in stochastic simulation under input uncertainty. ACM Trans. Model. Comput. Simul. 30(1):1–24.CrossrefGoogle Scholar
  • Zouaoui F, Wilson JR (2004) Accounting for input-model and input-parameter uncertainties in simulation. IIE Trans. 36(11):1135–1151.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.