Uncertainty Quantification and Exploration for Reinforcement Learning

Yi Zhu
Yi Zhu
[email protected]
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208;
Search for more papers by this author
,
Jing Dong
Corresponding Author
Jing Dong
[email protected]
https://orcid.org/0000-0001-6387-4088
Division, Risk and Operations Division, Columbia Business School, New York, New York 10027;
Search for more papers by this author
,
Henry Lam
Henry Lam
[email protected]
https://orcid.org/0000-0002-3193-563X
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author

Yi Zhu

[email protected]

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208;

Search for more papers by this author

Jing Dong

Corresponding Author

Jing Dong

[email protected]

https://orcid.org/0000-0001-6387-4088

Division, Risk and Operations Division, Columbia Business School, New York, New York 10027;

Search for more papers by this author

Henry Lam

[email protected]

https://orcid.org/0000-0002-3193-563X

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:2 Mar 2023https://doi.org/10.1287/opre.2023.2436

References

Achiam J, Held D, Tamar A, Abbeel P (2017) Constrained policy optimization. Proc. 34th Internat. Conf. Machine Learn. (JMLR), 70:22–31.Google Scholar
Altman E (1999) Constrained Markov Decision Processes, vol. 7 (CRC Press, Boca Raton, FL).Google Scholar
Amini A, Gilitschenski I, Phillips J, Moseyko J, Banerjee R, Karaman S, Rus D (2020) Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot. Autom. Lett. 5(2):1143–1150.Crossref, Google Scholar
Audibert JY, Bubeck S (2010) Best arm identification in multi-armed bandits. 23rd Annual Conf. Learn. Theory (COLT 2010), 41–53.Google Scholar
Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 263–272.Google Scholar
Barton RR (2012) Tutorial: Input uncertainty in output analysis. Laroque C, Himmelspach J, Pasupathy R, Rose O, Uhrmacher A, eds. Proc. 2012 Winter Simulation Conf. (IEEE, Piscataway, NJ), 1–12.Google Scholar
Barton RR, Nelson BL, Xie W (2013) Quantifying input uncertainty via simulation confidence intervals. INFORMS J. Comput. 26(1):74–87.Link, Google Scholar
Barton RR, Schruben LW (2001) Resampling methods for input modeling. Peters BA, Smith JS, Medeiros DJ, Rohrer MW, eds. Proc. 2001 Winter Simulation Conf. (IEEE, Piscataway, NJ), 1:372–378.Google Scholar
Bayraksan G, Morton DP (2006) Assessing solution quality in stochastic programs. Math. Programming 108(2–3):495–514.Crossref, Google Scholar
Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. Internat. Conf. Machine Learn. (PMLR), 449–458.Google Scholar
Boutilier C, Lu T (2016) Budget allocation using weakly coupled, constrained Markov decision processes. Proc. Thirty-Second Conf. Uncertainty Artificial Intelligence (AUAI Press), 52–61.Google Scholar
Chen CH, He D, Fu M (2006) Efficient dynamic simulation allocation in ordinal optimization. IEEE Trans. Automat. Control. 51(12):2005–2009.Crossref, Google Scholar
Chen CH, Lee LH (2011) Stochastic Simulation Optimization: An Optimal Computing Budget Allocation, vol. 1 (World Scientific, Singapore).Google Scholar
Chen S, Devraj A, Busic A, Meyn S (2020) Explicit mean-square error bounds for Monte-Carlo and linear stochastic approximation. Internat. Conf. Artificial Intelligence Statist. (PMLR), 4173–4183.Google Scholar
Chen W, Gao S, Chen CH, Shi L (2013) An optimal sample allocation strategy for partition-based random search. IEEE Trans. Autom. Sci. Engrg. 11(1):177–186.Crossref, Google Scholar
Cheng RC, Holland W (1997) Sensitivity of computer simulation experiments to errors in input data. J. Statist. Comput. Simul. 57(1–4):219–241.Crossref, Google Scholar
Cheng RC, Holland W (2004) Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14(4):344–362.Crossref, Google Scholar
Chick SE (2001) Input distribution selection for simulation experiments: Accounting for input uncertainty. Oper. Res. 49(5):744–758.Link, Google Scholar
Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learn. Res. 18(1):6070–6120.Google Scholar
Corso A, Moss RJ, Koren M, Lee R, Kochenderfer MJ (2020) A survey of algorithms for black-box safety validation. Preprint, submitted May 6, https://doi.org/10.48550/arXiv.2005.02979.Google Scholar
Devraj AM, Meyn SP (2017) Zap Q-learning. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 2232–2241.Google Scholar
Dong J, Zhu Y (2016) Three asymptotic regimes for ranking and selection with general sample distributions. Proc. 2016 Winter Simulation Conf. (IEEE, Piscataway, NJ), 277–288.Google Scholar
Feinberg EA, Rothblum UG (2012) Splitting randomized stationary policies in total-reward Markov decision processes. Math. Oper. Res. 37(1):129–153.Link, Google Scholar
Gao S, Chen W (2016) Efficient feasibility determination with multiple performance measure constraints. IEEE Trans. Automat. Control. 62(1):113–122.Crossref, Google Scholar
Gao S, Xiao H, Zhou E, Chen W (2017) Robust ranking and selection with optimal computing budget allocation. Automatica J. IFAC. 81:30–36.Crossref, Google Scholar
Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Proc. 36th Winter Simulation Conf., 577–585.Google Scholar
Gordon GJ (1995) Stable function approximation in dynamic programming. Machine Learn. Proc. (Elsevier), 261–268.Google Scholar
Higle JL, Sen S (2013) Stochastic Decomposition: A Statistical Method for Large Scale Stochastic Linear Programming, vol. 8 (Springer Science & Business Media).Google Scholar
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
Jia QS (2012) Efficient computing budget allocation for simulation-based policy improvement. IEEE Trans. Autom. Sci. Engrg. 9(2):342–352.Crossref, Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inf. Process. Syst. 31:4863–4873.Google Scholar
Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London.Google Scholar
Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E, Quillen D, et al. (2018) QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. Preprint, submitted June 27, https://doi.org/10.48550/arxiv.1806.10293.Google Scholar
Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
Kearns M, Singh S (1998) Finite-sample convergence rates for Q-learning and indirect algorithms. Proc. Conf. Adv. Neural Inform. Processing Systems II, 996–1002.Google Scholar
Kim SH, Nelson BL (2007) Recent advances in ranking and selection. 2007 Winter Simulation Conf. (IEEE, Piscataway, NJ), 162–172.Google Scholar
Kiran BR, Sobh I, Talpaert V, Mannion P, Sallab AAA, Yogamani S, Pérez P (2020) Deep reinforcement learning for autonomous driving: A survey. Preprint, submitted February 2, https://doi.org/10.48550/arxiv.2002.00444.Google Scholar
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014) Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat. 8(1):1225.Google Scholar
Lam H (2016) Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. 2016 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 178–192.Google Scholar
Lee LH, Pujowidianto NA, Li LW, Chen CH, Yap CM (2012) Approximate simulation budget allocation for selecting the best design in the presence of stochastic constraints. IEEE Trans. Automat. Control. 57(11):2940–2945.Crossref, Google Scholar
Mak WK, Morton DP, Wood RK (1999) Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24(1–2):47–56.Crossref, Google Scholar
Mannor S, Simester D, Sun P, Tsitsiklis JN (2004) Bias and variance in value function estimation. Proc. 21st Internat. Conf. Machine Learn. (ACM), 72.Google Scholar
Mannor S, Simester D, Sun P, Tsitsiklis JN (2007) Bias and variance approximation in value function estimates. Management Sci. 53(2):308–322.Link, Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Adv. Neural Inf. Process. Syst. 26:3003–3011.Google Scholar
Peng Y, Chen CH, Fu MC, Hu JQ (2017) Gradient-based myopic allocation policy: An efficient sampling procedure in a low-confidence scenario. IEEE Trans. Automat. Control. 63(9):3091–3097.Crossref, Google Scholar
Peng Y, Chen CH, Fu MC, Hu JQ, Ryzhov IO (2020) Efficient sampling allocation procedures for optimal quantile selection. INFORMS J. Comput. 33(1):230–245.Link, Google Scholar
Peng Y, Chong EK, Chen CH, Fu MC (2018) Ranking and selection as stochastic control. IEEE Trans. Automat. Control. 63(8):2359–2373.Crossref, Google Scholar
Putta SR, Tulabandhula T (2017) Pure exploration in episodic fixed-horizon Markov decision processes. AAMAS, 1703–1704.Google Scholar
Russo D (2016) Simple Bayesian algorithms for best arm identification. Conf. Learn. Theory, 1417–1418.Google Scholar
Ryzhov IO (2018) The local time method for targeting and selection. Oper. Res. 66(5):1406–1422.Link, Google Scholar
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
Schöner H (2017) The role of simulation in development and testing of autonomous vehicles. Driving Simulation Conf., Stuttgart.Google Scholar
Serfling RJ (2009) Approximation Theorems of Mathematical Statistics, vol. 162. (John Wiley & Sons, Hoboken, NJ).Google Scholar
Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inf. Process. Syst. 31:3111–3121.Google Scholar
Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
Shen H, Hong LJ, Zhang X (2021) Ranking and selection with covariates for personalized decision making. INFORMS J. Comput. 33(4):1259–1684.Google Scholar
Shin D, Broadie M, Zeevi A (2016) Tractable sampling strategies for quantile-based ordinal optimization. 2016 Winter Simulation Conf. (WSC) (IEEE, Piscataway, NJ), 847–858.Google Scholar
Song E, Nelson BL (2015) Quickly assessing contributions to input uncertainty. IIE Trans. 47(9):893–909.Crossref, Google Scholar
Song E, Nelson BL, Pegden CD (2014) Advanced tutorial: Input uncertainty quantification. Tolk A, Diallo S, Ryzhov I, Yilmaz L, Buckley S, Miller J, eds. Proc. 2014 Winter Simulation Conf. (IEEE, Piscataway, NJ), 162–176.Google Scholar
Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74(8):1309–1331.Crossref, Google Scholar
Xie W, Nelson BL, Barton RR (2014) A Bayesian framework for quantifying uncertainty in stochastic simulation. Oper. Res. 62(6):1439–1452.Link, Google Scholar
Yang L, Wang M (2020) Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. Internat. Conf. Machine Learn. (PMLR), 10746–10756.Google Scholar
Yi Y, Xie W (2017) An efficient budget allocation approach for quantifying the impact of input uncertainty in stochastic simulation. ACM Trans. Model. Comput. Simul. 27(4):25.Crossref, Google Scholar
Zhu H, Liu T, Zhou E (2020) Risk quantification in stochastic simulation under input uncertainty. ACM Trans. Model. Comput. Simul. 30(1):1–24.Crossref, Google Scholar
Zouaoui F, Wilson JR (2004) Accounting for input-model and input-parameter uncertainties in simulation. IIE Trans. 36(11):1135–1151.Crossref, Google Scholar

Volume 72, Issue 4

July-August 2024

Pages iii-vi, 1317-1750, C2-C3

Article Information

Metrics

Information

Received:May 25, 2020
Accepted:November 23, 2022
Published Online:March 02, 2023

Cite as

Yi Zhu, Jing Dong, Henry Lam (2023) Uncertainty Quantification and Exploration for Reinforcement Learning. Operations Research 72(4):1689-1709.

https://doi.org/10.1287/opre.2023.2436

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Uncertainty Quantification and Exploration for Reinforcement Learning

References

Volume 72, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News