Generalized Eluder Coefficient: A Unified Framework for Interactive Decision Making in Markov Decision Processes, Partially Observable Markov Decision Processes, and Beyond

Published Online:https://doi.org/10.1287/moor.2023.0195

References

  • [1] Abbasi-Yadkori Y , Pál D , Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J , Zemel R , Bartlett P , Pereira F , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
  • [2] Agarwal A , Zhang T (2022) Model-based RL with optimistic posterior sampling: Structural conditions and sample complexity. Preprint, submitted June 15, https://arxiv.org/abs/2206.07659.Google Scholar
  • [3] Agarwal A , Zhang T (2022) Non-linear reinforcement learning in large action spaces: Structural conditions and sample-efficiency of posterior sampling. Preprint, submitted March 15, https://arxiv.org/abs/2203.08248.Google Scholar
  • [4] Agrawal S , Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Guyon I , Von Luxburg U , Bengio S , Wallach H , Fergus R , Vishwanathan S , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 30 (Curran Associates, Red Hook, NY), 1184–1194.Google Scholar
  • [5] Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(2002):397–422.Google Scholar
  • [6] Auer P , Cesa-Bianchi N , Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.CrossrefGoogle Scholar
  • [7] Auer P , Jaksch T , Ortner R (2008) Near-optimal regret bounds for reinforcement learning. Koller D , Schuurmans D , Bengio Y , Bottou L , eds. Advances in Neural Information Processing Systems , vol. 21 (Curran Associates, Red Hook, NY), 89–96.Google Scholar
  • [8] Ayoub A , Jia Z , Szepesvari C , Wang M , Yang L (2020) Model-based reinforcement learning with value-targeted regression. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 463–474.Google Scholar
  • [9] Azar MG , Osband I , Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 263–272.Google Scholar
  • [10] Azizzadenesheli K , Lazaric A , Anandkumar A (2016) Reinforcement learning of POMDPs using spectral methods. Feldman V , Rakhlin A , Shamir O , eds. Proc. 29th Annual Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 49 (JMLR.org), 193–256.Google Scholar
  • [11] Azuma K (1967) Weighted sums of certain dependent random variables. Tohoku Math. J. 19(3):357–367.CrossrefGoogle Scholar
  • [12] Bhatt S , Mao W , Koppel A , Başar T (2021) Semiparametric information state embedding for policy search under imperfect information. Prandini M, ed. Proc. 60th IEEE Conf. Decision Control (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 4501–4506.Google Scholar
  • [13] Boots B , Siddiqi SM , Gordon GJ (2011) Closing the learning-planning loop with predictive state representations. Internat. J. Robotics Res. 30(7):954–966.CrossrefGoogle Scholar
  • [14] Cai Q , Yang Z , Wang Z (2022) Reinforcement learning from partial observation: Linear function approximation with provable sample efficiency. Chaudhuri K , Jegelka S , Song L , Szepesvari C , Niu G , Sabato S , eds. Proc. 39th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 162 (JMLR.org), 2485–2522.Google Scholar
  • [15] Cai Q , Yang Z , Jin C , Wang Z (2020) Provably efficient exploration in policy optimization. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn . Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 1283–1294.Google Scholar
  • [16] Chen F , Bai Y , Mei S (2022) Partially observable RL with B-stability: Unified structural condition and sharp sample-efficient algorithms. Preprint, submitted September 29, https://arxiv.org/abs/2209.14990.Google Scholar
  • [17] Chen F , Mei S , Bai Y (2022) Unified algorithms for RL with decision-estimation coefficients: No-regret, PAC, and reward-free learning. Preprint, submitted September 23, https://arxiv.org/abs/2209.11745.Google Scholar
  • [18] Chen Z , Li CJ , Yuan A , Gu Q , Jordan MI (2022) A general framework for sample-efficient function approximation in reinforcement learning. Preprint, submitted September 30, https://arxiv.org/abs/2209.15634.Google Scholar
  • [19] Chua K , Calandra R , McAllister R , Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4754–4765.Google Scholar
  • [20] Dann C , Mohri M , Zhang T , Zimmert J (2021) A provably efficient model-free posterior sampling method for episodic reinforcement learning. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 12040–12051.Google Scholar
  • [21] Du S , Krishnamurthy A , Jiang N , Agarwal A , Dudik M , Langford J (2019) Provably efficient RL with rich observations via latent state decoding. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 1665–1674.Google Scholar
  • [22] Du S , Kakade S , Lee J , Lovett S , Mahajan G , Sun W , Wang R (2021) Bilinear classes: A structural framework for provable generalization in RL. Marina M , Zhang T , eds. Proc. 38th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 139 (JMLR.org), 2826–2836.Google Scholar
  • [23] Efroni Y , Jin C , Krishnamurthy A , Miryoosefi S (2022) Provable reinforcement learning with a short-term memory. Preprint, submitted February 8, https://arxiv.org/abs/2202.03983.Google Scholar
  • [24] Foster DJ , Kakade SM , Qian J , Rakhlin A (2021) The statistical complexity of interactive decision making. Preprint, submitted December 27, https://arxiv.org/abs/2112.13487.Google Scholar
  • [25] Foster DJ , Golowich N , Qian J , Rakhlin A , Sekhari A (2023) Model-free reinforcement learning with the decision-estimation coefficient. Oh A , Naumann T , Globerson A , Saenko K , Hardt M , Levine S , eds. Advances in Neural Information Processing Systems , vol. 36 (Curran Associates, Red Hook, NY), 20080–20117.CrossrefGoogle Scholar
  • [26] Golowich N , Moitra A , Rohatgi D (2022) Learning in observable POMDPs, without computationally intractable oracles. Preprint, submitted June 7, https://arxiv.org/abs/2206.03446.Google Scholar
  • [27] Golowich N , Moitra A , Rohatgi D (2022) Planning in observable POMDPs in quasipolynomial time. Preprint, submitted January 12, https://arxiv.org/abs/2201.04735.Google Scholar
  • [28] Guo ZD , Doroudi S , Brunskill E (2016) A PAC RL algorithm for episodic POMDPs. Gretton A , Robert CC , eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 51 (JMLR.org), 510–518.Google Scholar
  • [29] Hefny A , Downey C , Gordon GJ (2015) Supervised learning for dynamical system learning. Cortes C , Lawrence N , Lee D , Sugiyama M , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 28 (Curran Associates, Red Hook, NY), 1963–1971.Google Scholar
  • [30] Jaeger H (2000) Observable operator models for discrete stochastic time series. Neural Comput. 12(6):1371–1398.CrossrefGoogle Scholar
  • [31] Jaksch T , Ortner R , Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine. Learn. Res. 11(51):1563–1600.Google Scholar
  • [32] Jiang N , Kulesza A , Singh S (2018) Completing state representations using spectral learning. Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4328–4337.Google Scholar
  • [33] Jiang N , Krishnamurthy A , Agarwal A , Langford J , Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 1704–1713.Google Scholar
  • [34] Jiang N , Krishnamurthy A , Agarwal A , Langford J , Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 1704–1713.Google Scholar
  • [35] Jin C , Liu Q , Miryoosefi S (2021) Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 13406–13418.Google Scholar
  • [36] Jin C , Allen-Zhu Z , Bubeck S , Jordan MI (2018) Is Q-learning provably efficient? Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
  • [37] Jin C , Kakade S , Krishnamurthy A , Liu Q (2020) Sample-efficient reinforcement learning of undercomplete POMDPs. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 18530–18539.Google Scholar
  • [38] Jin C , Yang Z , Wang Z , Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J , Agarwal S , eds. Proc. 33rd Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 125 (JMLR.org), 2137–2143.Google Scholar
  • [39] Kakade S , Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C , Hoffmann AG , eds. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 267–274.Google Scholar
  • [40] Kakade S , Krishnamurthy A , Lowrey K , Ohnishi M , Sun W (2020) Information theoretic regret bounds for online nonlinear control. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 15312–15325.Google Scholar
  • [41] Krishnamurthy A , Agarwal A , Langford J (2016) PAC reinforcement learning with rich observations. Lee D , Sugiyama M , Luxburg U , Guyon I , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 29 (Curran Associates, Red Hook, NY), 1840–1848.Google Scholar
  • [42] Kwon J , Efroni Y , Caramanis C , Mannor S (2021) RL for latent MDPs: Regret guarantees and a lower bound. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 24523–24534.Google Scholar
  • [43] Littman M , Sutton RS (2001) Predictive representations of state. Dietterich T , Becker S , Ghahramani Z , eds. Advances in Neural Information Processing Systems , vol. 14 (Curran Associates, Red Hook, NY), 1555–1561.Google Scholar
  • [44] Liu Q , Chung A , Szepesvári C , Jin C (2022) When is partially observable reinforcement learning not scary? Preprint, submitted April 19, https://arxiv.org/abs/2204.08967.Google Scholar
  • [45] Liu Q , Netrapalli P , Szepesvari C , Jin C (2022) Optimistic MLE—A generic model-based algorithm for partially observable sequential decision making. Preprint, submitted September 29, https://arxiv.org/abs/2209.14997.Google Scholar
  • [46] Liu Z , Lu M , Xiong W , Zhong H , Hu H , Zhang S , Zheng S , Yang Z , Wang Z (2024) Maximize to explore: One objective function fusing estimation, planning, and exploration. Oh A , Naumann T , Globerson A , Saenko K , Hardt M , Levine S , eds . Advances in Neural Information Processing Systems , vol. 36 (Curran Associates, Red Hook, NY), 22151–22165.Google Scholar
  • [47] Lu X , Van Roy B (2017) Ensemble sampling. Guyon I , Von Luxburg U , Bengio S , Wallach H , Fergus R , Vishwanathan S , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 30 (Curran Associates, Red Hook, NY), 3258–3266.CrossrefGoogle Scholar
  • [48] Modi A , Jiang N , Tewari A , Singh S (2020) Sample complexity of reinforcement learning using linearly combined model ensembles. Chiappa S , Calandra R , eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 108 (JMLR.org), 2010–2020.Google Scholar
  • [49] Munos R (2003) Error bounds for approximate policy iteration. Fawcett T, Mishra N, eds. Internat. Conf. Machine Learn. (Washington, DC), vol. 3, 560–567.Google Scholar
  • [50] Nagabandi A , Konolige K , Levine S , Kumar V (2020) Deep dynamics models for learning dexterous manipulation. Kaelbling LP , Kragic D , Sugiura K , eds. Proc. Conf. Robot Learn ., Proceedings of Machine Learning Research, vol. 100 (JMLR.org), 1101–1112.Google Scholar
  • [51] Osband I , Van Roy B (2014) Model-based reinforcement learning and the eluder dimension. Ghahramani Z , Welling M , Cortes C , Lawrence N , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 27 (Curran Associates, Red Hook, NY), 1466–1474.Google Scholar
  • [52] Osband I , Van Roy B , Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF , Weinberger KQ , eds. Proc. 33rd Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 48 (JMLR.org), 2377–2386.Google Scholar
  • [53] Osband I , Blundell C , Pritzel A , Van Roy B (2016) Deep exploration via bootstrapped DQN. Lee D , Sugiyama M , Luxburg U , Guyon I , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 29 (Curran Associates, Red Hook, NY), 4026–4034.Google Scholar
  • [54] Papadimitriou CH , Tsitsiklis JN (1987) The complexity of Markov decision processes. Math. Oper. Res. 12(3):441–450.LinkGoogle Scholar
  • [55] Roberts GO , Tweedie RL (1996) Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4):341–363.CrossrefGoogle Scholar
  • [56] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Wallach H , Larochelle H , Beygelzimer A , d’Alché-Buc F , Fox E , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 32 (Curran Associates, Red Hook, NY), 14433–14443.Google Scholar
  • [57] Russo D , Van Roy B (2013) Eluder dimension and the sample complexity of optimistic exploration. Burges CJ , Bottou L , Welling M , Ghahramani Z , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 26 (Curran Associates, Red Hook, NY), 2256–2264.Google Scholar
  • [58] Russo D , Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • [59] Singh S , James M , Rudary M (2012) Predictive state representations: A new theory for modeling dynamical systems. Preprint, submitted July 11, https://arxiv.org/abs/1207.4167.Google Scholar
  • [60] Strens M (2000) A Bayesian framework for reinforcement learning. Langley P , ed. Proc. 17th Internat. Conf. Machine Learn . (Morgan Kaufmann Publishers, San Francisco), 943–950.Google Scholar
  • [61] Subramanian J , Sinha A , Seraj R , Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J. Machine Learn. Res. 23(12):1–83.Google Scholar
  • [62] Sun W , Jiang N , Krishnamurthy A , Agarwal A , Langford J (2019) Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. Beygelzimer A , Hsu D , eds. Proc. 32nd Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 99 (JMLR.org), 2898–2933.Google Scholar
  • [63] Sutton RS , Barto AG (2018) Reinforcement Learning: An Introduction, Adaptive Computation and Machine Learning Series, vol. 1, no. 2, 2nd ed. (MIT Press, Cambridge, MA), 25.Google Scholar
  • [64] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294.CrossrefGoogle Scholar
  • [65] Uehara M , Sekhari A , Lee JD , Kallus N , Sun W (2022) Computationally efficient PAC RL in POMDPs with latent determinism and conditional embeddings. Preprint, submitted June 24, https://arxiv.org/abs/2206.12081.Google Scholar
  • [66] Uehara M , Sekhari A , Lee JD , Kallus N , Sun W (2022) Provably efficient reinforcement learning in partially observable dynamical systems. Preprint, submitted June 24, https://arxiv.org/abs/2206.12020.Google Scholar
  • [67] Van Handel R (2014) Probability in high dimension. Technical report, Princeton University, Princeton, NJ.Google Scholar
  • [68] Wang R , Salakhutdinov RR , Yang L (2020) Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 6123–6135.Google Scholar
  • [69] Wang L , Cai Q , Yang Z , Wang Z (2022) Embed to control partially observed systems: Representation learning with provable sample efficiency. Preprint, submitted May 26, https://arxiv.org/abs/2205.13476.Google Scholar
  • [70] Wang Y , Wang R , Du SS , Krishnamurthy A (2019) Optimism in reinforcement learning with generalized linear function approximation. Preprint, submitted December 9, https://arxiv.org/abs/1912.04136.Google Scholar
  • [71] Welling M , Teh YW (2011) Bayesian learning via stochastic gradient Langevin dynamics. Getoor L, Scheffer T, eds. Proc. 28th Internat. Conf. Machine Learn. (Bellevue, Washington) , 681–688.Google Scholar
  • [72] Xiong Y , Chen N , Gao X , Zhou X (2021) Sublinear regret for learning POMDPs. Preprint, submitted July 8, https://arxiv.org/abs/2107.03635.Google Scholar
  • [73] Yang L , Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 6995–7004.Google Scholar
  • [74] Zanette A , Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 7304–7312.Google Scholar
  • [75] Zanette A , Lazaric A , Kochenderfer M , Brunskill E (2020) Learning near optimal policies with low inherent bellman error. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 10978–10989.Google Scholar
  • [76] Zanette A , Brandfonbrener D , Brunskill E , Pirotta M , Lazaric A (2020) Frequentist regret bounds for randomized least-squares value iteration. Chiappa S , Calandra R , eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 108 (JMLR.org), 1954–1964.Google Scholar
  • [77] Zhan W , Uehara M , Sun W , Lee JD (2022) PAC reinforcement learning for predictive state representations. Preprint, submitted July 12, https://arxiv.org/abs/2207.05738.Google Scholar
  • [78] Zhang T (2006) From ε -entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist. 34(5):2180–2210.CrossrefGoogle Scholar
  • [79] Zhang T (2022) Feel-good Thompson sampling for contextual bandits and reinforcement learning. SIAM J. Math. Data Sci. 4(2):834–857.CrossrefGoogle Scholar
  • [80] Zhang Z , Ji X , Du S (2021) Is reinforcement learning more difficult than bandits? A near-optimal algorithm escaping the curse of horizon. Belkin M , Kpotufe S , eds. Proc. 34th Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 134 (JMLR.org), 4528–4531.Google Scholar
  • [81] Zhang Z , Yang Z , Liu H , Tokekar P , Huang F (2021) Reinforcement learning under a multi-agent predictive state representation model: Method and theory. Finn C, Choi Y, Deisenroth M, eds. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
  • [82] Zhong H , Xiong W , Zheng S , Wang L , Wang Z , Yang Z , Zhang T (2022) GEC: A unified framework for interactive decision making in MDP, POMDP, and beyond. Preprint, submitted November 3, https://arxiv.org/abs/2211.01962.Google Scholar
  • [83] Zhou D , Gu Q , Szepesvari C (2021) Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. Belkin M , Kpotufe S , eds. Proc. 34th Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 134 (JMLR.org), 4532–4576.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.