Generalized Eluder Coefficient: A Unified Framework for Interactive Decision Making in Markov Decision Processes, Partially Observable Markov Decision Processes, and Beyond

Han Zhong
Han Zhong
[email protected]
https://orcid.org/0009-0009-5250-620X
Center for Data Science, Peking University, Beijing 100871, China
Search for more papers by this author
,
Wei Xiong
Wei Xiong
[email protected]
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author
,
Sirui Zheng
Sirui Zheng
[email protected]
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Liwei Wang
Liwei Wang
[email protected]
School of Intelligence Science and Technology, Peking University, Beijing 100871, China
Search for more papers by this author
,
Zhaoran Wang
Corresponding Author
Zhaoran Wang
[email protected]
https://orcid.org/0000-0002-1824-2580
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Zhuoran Yang
Zhuoran Yang
[email protected]
https://orcid.org/0000-0001-5269-9958
Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511
Search for more papers by this author
,
Tong Zhang
Tong Zhang
[email protected]
https://orcid.org/0000-0002-5511-2558
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author

Center for Data Science, Peking University, Beijing 100871, China

Search for more papers by this author

Wei Xiong

[email protected]

Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Sirui Zheng

[email protected]

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Liwei Wang

[email protected]

School of Intelligence Science and Technology, Peking University, Beijing 100871, China

Search for more papers by this author

Zhaoran Wang

Corresponding Author

Zhaoran Wang

[email protected]

https://orcid.org/0000-0002-1824-2580

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Zhuoran Yang

[email protected]

https://orcid.org/0000-0001-5269-9958

Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511

Search for more papers by this author

Tong Zhang

[email protected]

https://orcid.org/0000-0002-5511-2558

Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Published Online:15 Jun 2026https://doi.org/10.1287/moor.2023.0195

References

[1] Abbasi-Yadkori Y , Pál D , Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J , Zemel R , Bartlett P , Pereira F , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
[2] Agarwal A , Zhang T (2022) Model-based RL with optimistic posterior sampling: Structural conditions and sample complexity. Preprint, submitted June 15, https://arxiv.org/abs/2206.07659.Google Scholar
[3] Agarwal A , Zhang T (2022) Non-linear reinforcement learning in large action spaces: Structural conditions and sample-efficiency of posterior sampling. Preprint, submitted March 15, https://arxiv.org/abs/2203.08248.Google Scholar
[4] Agrawal S , Jia R (2017) Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Guyon I , Von Luxburg U , Bengio S , Wallach H , Fergus R , Vishwanathan S , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 30 (Curran Associates, Red Hook, NY), 1184–1194.Google Scholar
[5] Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(2002):397–422.Google Scholar
[6] Auer P , Cesa-Bianchi N , Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235–256.Crossref, Google Scholar
[7] Auer P , Jaksch T , Ortner R (2008) Near-optimal regret bounds for reinforcement learning. Koller D , Schuurmans D , Bengio Y , Bottou L , eds. Advances in Neural Information Processing Systems , vol. 21 (Curran Associates, Red Hook, NY), 89–96.Google Scholar
[8] Ayoub A , Jia Z , Szepesvari C , Wang M , Yang L (2020) Model-based reinforcement learning with value-targeted regression. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 463–474.Google Scholar
[9] Azar MG , Osband I , Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 263–272.Google Scholar
[10] Azizzadenesheli K , Lazaric A , Anandkumar A (2016) Reinforcement learning of POMDPs using spectral methods. Feldman V , Rakhlin A , Shamir O , eds. Proc. 29th Annual Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 49 (JMLR.org), 193–256.Google Scholar
[11] Azuma K (1967) Weighted sums of certain dependent random variables. Tohoku Math. J. 19(3):357–367.Crossref, Google Scholar
[12] Bhatt S , Mao W , Koppel A , Başar T (2021) Semiparametric information state embedding for policy search under imperfect information. Prandini M, ed. Proc. 60th IEEE Conf. Decision Control (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 4501–4506.Google Scholar
[13] Boots B , Siddiqi SM , Gordon GJ (2011) Closing the learning-planning loop with predictive state representations. Internat. J. Robotics Res. 30(7):954–966.Crossref, Google Scholar
[14] Cai Q , Yang Z , Wang Z (2022) Reinforcement learning from partial observation: Linear function approximation with provable sample efficiency. Chaudhuri K , Jegelka S , Song L , Szepesvari C , Niu G , Sabato S , eds. Proc. 39th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 162 (JMLR.org), 2485–2522.Google Scholar
[15] Cai Q , Yang Z , Jin C , Wang Z (2020) Provably efficient exploration in policy optimization. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn . Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 1283–1294.Google Scholar
[16] Chen F , Bai Y , Mei S (2022) Partially observable RL with B-stability: Unified structural condition and sharp sample-efficient algorithms. Preprint, submitted September 29, https://arxiv.org/abs/2209.14990.Google Scholar
[17] Chen F , Mei S , Bai Y (2022) Unified algorithms for RL with decision-estimation coefficients: No-regret, PAC, and reward-free learning. Preprint, submitted September 23, https://arxiv.org/abs/2209.11745.Google Scholar
[18] Chen Z , Li CJ , Yuan A , Gu Q , Jordan MI (2022) A general framework for sample-efficient function approximation in reinforcement learning. Preprint, submitted September 30, https://arxiv.org/abs/2209.15634.Google Scholar
[19] Chua K , Calandra R , McAllister R , Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4754–4765.Google Scholar
[20] Dann C , Mohri M , Zhang T , Zimmert J (2021) A provably efficient model-free posterior sampling method for episodic reinforcement learning. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 12040–12051.Google Scholar
[21] Du S , Krishnamurthy A , Jiang N , Agarwal A , Dudik M , Langford J (2019) Provably efficient RL with rich observations via latent state decoding. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 1665–1674.Google Scholar
[22] Du S , Kakade S , Lee J , Lovett S , Mahajan G , Sun W , Wang R (2021) Bilinear classes: A structural framework for provable generalization in RL. Marina M , Zhang T , eds. Proc. 38th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 139 (JMLR.org), 2826–2836.Google Scholar
[23] Efroni Y , Jin C , Krishnamurthy A , Miryoosefi S (2022) Provable reinforcement learning with a short-term memory. Preprint, submitted February 8, https://arxiv.org/abs/2202.03983.Google Scholar
[24] Foster DJ , Kakade SM , Qian J , Rakhlin A (2021) The statistical complexity of interactive decision making. Preprint, submitted December 27, https://arxiv.org/abs/2112.13487.Google Scholar
[25] Foster DJ , Golowich N , Qian J , Rakhlin A , Sekhari A (2023) Model-free reinforcement learning with the decision-estimation coefficient. Oh A , Naumann T , Globerson A , Saenko K , Hardt M , Levine S , eds. Advances in Neural Information Processing Systems , vol. 36 (Curran Associates, Red Hook, NY), 20080–20117.Crossref, Google Scholar
[26] Golowich N , Moitra A , Rohatgi D (2022) Learning in observable POMDPs, without computationally intractable oracles. Preprint, submitted June 7, https://arxiv.org/abs/2206.03446.Google Scholar
[27] Golowich N , Moitra A , Rohatgi D (2022) Planning in observable POMDPs in quasipolynomial time. Preprint, submitted January 12, https://arxiv.org/abs/2201.04735.Google Scholar
[28] Guo ZD , Doroudi S , Brunskill E (2016) A PAC RL algorithm for episodic POMDPs. Gretton A , Robert CC , eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 51 (JMLR.org), 510–518.Google Scholar
[29] Hefny A , Downey C , Gordon GJ (2015) Supervised learning for dynamical system learning. Cortes C , Lawrence N , Lee D , Sugiyama M , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 28 (Curran Associates, Red Hook, NY), 1963–1971.Google Scholar
[30] Jaeger H (2000) Observable operator models for discrete stochastic time series. Neural Comput. 12(6):1371–1398.Crossref, Google Scholar
[31] Jaksch T , Ortner R , Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine. Learn. Res. 11(51):1563–1600.Google Scholar
[32] Jiang N , Kulesza A , Singh S (2018) Completing state representations using spectral learning. Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4328–4337.Google Scholar
[33] Jiang N , Krishnamurthy A , Agarwal A , Langford J , Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 1704–1713.Google Scholar
[34] Jiang N , Krishnamurthy A , Agarwal A , Langford J , Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D , The YW , eds. Proc. 34th Internat. Conf. Machine Learn. , Proceedings of Machine Learning Research, vol. 70 (JMLR.org), 1704–1713.Google Scholar
[35] Jin C , Liu Q , Miryoosefi S (2021) Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 13406–13418.Google Scholar
[36] Jin C , Allen-Zhu Z , Bubeck S , Jordan MI (2018) Is Q-learning provably efficient? Bengio S , Wallach H , Larochelle H , Grauman K , Cesa-Bianchi N , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
[37] Jin C , Kakade S , Krishnamurthy A , Liu Q (2020) Sample-efficient reinforcement learning of undercomplete POMDPs. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 18530–18539.Google Scholar
[38] Jin C , Yang Z , Wang Z , Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Abernethy J , Agarwal S , eds. Proc. 33rd Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 125 (JMLR.org), 2137–2143.Google Scholar
[39] Kakade S , Langford J (2002) Approximately optimal approximate reinforcement learning. Sammut C , Hoffmann AG , eds. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, San Francisco), 267–274.Google Scholar
[40] Kakade S , Krishnamurthy A , Lowrey K , Ohnishi M , Sun W (2020) Information theoretic regret bounds for online nonlinear control. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 15312–15325.Google Scholar
[41] Krishnamurthy A , Agarwal A , Langford J (2016) PAC reinforcement learning with rich observations. Lee D , Sugiyama M , Luxburg U , Guyon I , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 29 (Curran Associates, Red Hook, NY), 1840–1848.Google Scholar
[42] Kwon J , Efroni Y , Caramanis C , Mannor S (2021) RL for latent MDPs: Regret guarantees and a lower bound. Ranzato M , Beygelzimer A , Dauphin Y , Liang PS , Wortman Vaughan J , eds. Advances in Neural Information Processing Systems , vol. 34 (Curran Associates, Red Hook, NY), 24523–24534.Google Scholar
[43] Littman M , Sutton RS (2001) Predictive representations of state. Dietterich T , Becker S , Ghahramani Z , eds. Advances in Neural Information Processing Systems , vol. 14 (Curran Associates, Red Hook, NY), 1555–1561.Google Scholar
[44] Liu Q , Chung A , Szepesvári C , Jin C (2022) When is partially observable reinforcement learning not scary? Preprint, submitted April 19, https://arxiv.org/abs/2204.08967.Google Scholar
[45] Liu Q , Netrapalli P , Szepesvari C , Jin C (2022) Optimistic MLE—A generic model-based algorithm for partially observable sequential decision making. Preprint, submitted September 29, https://arxiv.org/abs/2209.14997.Google Scholar
[46] Liu Z , Lu M , Xiong W , Zhong H , Hu H , Zhang S , Zheng S , Yang Z , Wang Z (2024) Maximize to explore: One objective function fusing estimation, planning, and exploration. Oh A , Naumann T , Globerson A , Saenko K , Hardt M , Levine S , eds . Advances in Neural Information Processing Systems , vol. 36 (Curran Associates, Red Hook, NY), 22151–22165.Google Scholar
[47] Lu X , Van Roy B (2017) Ensemble sampling. Guyon I , Von Luxburg U , Bengio S , Wallach H , Fergus R , Vishwanathan S , Garnett R , eds. Advances in Neural Information Processing Systems , vol. 30 (Curran Associates, Red Hook, NY), 3258–3266.Crossref, Google Scholar
[48] Modi A , Jiang N , Tewari A , Singh S (2020) Sample complexity of reinforcement learning using linearly combined model ensembles. Chiappa S , Calandra R , eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 108 (JMLR.org), 2010–2020.Google Scholar
[49] Munos R (2003) Error bounds for approximate policy iteration. Fawcett T, Mishra N, eds. Internat. Conf. Machine Learn. (Washington, DC), vol. 3, 560–567.Google Scholar
[50] Nagabandi A , Konolige K , Levine S , Kumar V (2020) Deep dynamics models for learning dexterous manipulation. Kaelbling LP , Kragic D , Sugiura K , eds. Proc. Conf. Robot Learn ., Proceedings of Machine Learning Research, vol. 100 (JMLR.org), 1101–1112.Google Scholar
[51] Osband I , Van Roy B (2014) Model-based reinforcement learning and the eluder dimension. Ghahramani Z , Welling M , Cortes C , Lawrence N , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 27 (Curran Associates, Red Hook, NY), 1466–1474.Google Scholar
[52] Osband I , Van Roy B , Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF , Weinberger KQ , eds. Proc. 33rd Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 48 (JMLR.org), 2377–2386.Google Scholar
[53] Osband I , Blundell C , Pritzel A , Van Roy B (2016) Deep exploration via bootstrapped DQN. Lee D , Sugiyama M , Luxburg U , Guyon I , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 29 (Curran Associates, Red Hook, NY), 4026–4034.Google Scholar
[54] Papadimitriou CH , Tsitsiklis JN (1987) The complexity of Markov decision processes. Math. Oper. Res. 12(3):441–450.Link, Google Scholar
[55] Roberts GO , Tweedie RL (1996) Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4):341–363.Crossref, Google Scholar
[56] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Wallach H , Larochelle H , Beygelzimer A , d’Alché-Buc F , Fox E , Garnett R , eds . Advances in Neural Information Processing Systems , vol. 32 (Curran Associates, Red Hook, NY), 14433–14443.Google Scholar
[57] Russo D , Van Roy B (2013) Eluder dimension and the sample complexity of optimistic exploration. Burges CJ , Bottou L , Welling M , Ghahramani Z , Weinberger KQ , eds . Advances in Neural Information Processing Systems , vol. 26 (Curran Associates, Red Hook, NY), 2256–2264.Google Scholar
[58] Russo D , Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
[59] Singh S , James M , Rudary M (2012) Predictive state representations: A new theory for modeling dynamical systems. Preprint, submitted July 11, https://arxiv.org/abs/1207.4167.Google Scholar
[60] Strens M (2000) A Bayesian framework for reinforcement learning. Langley P , ed. Proc. 17th Internat. Conf. Machine Learn . (Morgan Kaufmann Publishers, San Francisco), 943–950.Google Scholar
[61] Subramanian J , Sinha A , Seraj R , Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J. Machine Learn. Res. 23(12):1–83.Google Scholar
[62] Sun W , Jiang N , Krishnamurthy A , Agarwal A , Langford J (2019) Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. Beygelzimer A , Hsu D , eds. Proc. 32nd Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 99 (JMLR.org), 2898–2933.Google Scholar
[63] Sutton RS , Barto AG (2018) Reinforcement Learning: An Introduction, Adaptive Computation and Machine Learning Series, vol. 1, no. 2, 2nd ed. (MIT Press, Cambridge, MA), 25.Google Scholar
[64] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294.Crossref, Google Scholar
[65] Uehara M , Sekhari A , Lee JD , Kallus N , Sun W (2022) Computationally efficient PAC RL in POMDPs with latent determinism and conditional embeddings. Preprint, submitted June 24, https://arxiv.org/abs/2206.12081.Google Scholar
[66] Uehara M , Sekhari A , Lee JD , Kallus N , Sun W (2022) Provably efficient reinforcement learning in partially observable dynamical systems. Preprint, submitted June 24, https://arxiv.org/abs/2206.12020.Google Scholar
[67] Van Handel R (2014) Probability in high dimension. Technical report, Princeton University, Princeton, NJ.Google Scholar
[68] Wang R , Salakhutdinov RR , Yang L (2020) Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Larochelle H , Ranzato M , Hadsell R , Balcan MF , Lin H , eds. Advances in Neural Information Processing Systems , vol. 33 (Curran Associates, Red Hook, NY), 6123–6135.Google Scholar
[69] Wang L , Cai Q , Yang Z , Wang Z (2022) Embed to control partially observed systems: Representation learning with provable sample efficiency. Preprint, submitted May 26, https://arxiv.org/abs/2205.13476.Google Scholar
[70] Wang Y , Wang R , Du SS , Krishnamurthy A (2019) Optimism in reinforcement learning with generalized linear function approximation. Preprint, submitted December 9, https://arxiv.org/abs/1912.04136.Google Scholar
[71] Welling M , Teh YW (2011) Bayesian learning via stochastic gradient Langevin dynamics. Getoor L, Scheffer T, eds. Proc. 28th Internat. Conf. Machine Learn. (Bellevue, Washington) , 681–688.Google Scholar
[72] Xiong Y , Chen N , Gao X , Zhou X (2021) Sublinear regret for learning POMDPs. Preprint, submitted July 8, https://arxiv.org/abs/2107.03635.Google Scholar
[73] Yang L , Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 6995–7004.Google Scholar
[74] Zanette A , Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K , Salakhutdinov R , eds. Proc. 36th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 97 (JMLR.org), 7304–7312.Google Scholar
[75] Zanette A , Lazaric A , Kochenderfer M , Brunskill E (2020) Learning near optimal policies with low inherent bellman error. Daumé H , Singh A , eds. Proc 37th Internat. Conf. Machine Learn ., Proceedings of Machine Learning Research, vol. 119 (JMLR.org), 10978–10989.Google Scholar
[76] Zanette A , Brandfonbrener D , Brunskill E , Pirotta M , Lazaric A (2020) Frequentist regret bounds for randomized least-squares value iteration. Chiappa S , Calandra R , eds. Proc. 23rd Internat. Conf. Artificial Intelligence Statist. , Proceedings of Machine Learning Research, vol. 108 (JMLR.org), 1954–1964.Google Scholar
[77] Zhan W , Uehara M , Sun W , Lee JD (2022) PAC reinforcement learning for predictive state representations. Preprint, submitted July 12, https://arxiv.org/abs/2207.05738.Google Scholar
[78] Zhang T (2006) From ε -entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist. 34(5):2180–2210.Crossref, Google Scholar
[79] Zhang T (2022) Feel-good Thompson sampling for contextual bandits and reinforcement learning. SIAM J. Math. Data Sci. 4(2):834–857.Crossref, Google Scholar
[80] Zhang Z , Ji X , Du S (2021) Is reinforcement learning more difficult than bandits? A near-optimal algorithm escaping the curse of horizon. Belkin M , Kpotufe S , eds. Proc. 34th Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 134 (JMLR.org), 4528–4531.Google Scholar
[81] Zhang Z , Yang Z , Liu H , Tokekar P , Huang F (2021) Reinforcement learning under a multi-agent predictive state representation model: Method and theory. Finn C, Choi Y, Deisenroth M, eds. Proc. Internat. Conf. Learn. Representations (OpenReview).Google Scholar
[82] Zhong H , Xiong W , Zheng S , Wang L , Wang Z , Yang Z , Zhang T (2022) GEC: A unified framework for interactive decision making in MDP, POMDP, and beyond. Preprint, submitted November 3, https://arxiv.org/abs/2211.01962.Google Scholar
[83] Zhou D , Gu Q , Szepesvari C (2021) Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. Belkin M , Kpotufe S , eds. Proc. 34th Conf. Learn. Theory , Proceedings of Machine Learning Research, vol. 134 (JMLR.org), 4532–4576.Google Scholar

cover image Mathematics of Operations Research

Articles In Advance

Article Information

Metrics

Information

Received:April 19, 2023
Accepted:May 17, 2025
Published Online:June 15, 2026

Cite as

Han Zhong , Wei Xiong , Sirui Zheng , Liwei Wang , Zhaoran Wang , Zhuoran Yang , Tong Zhang (2026) Generalized Eluder Coefficient: A Unified Framework for Interactive Decision Making in Markov Decision Processes, Partially Observable Markov Decision Processes, and Beyond. Mathematics of Operations Research 0(0).

https://doi.org/10.1287/moor.2023.0195

Keywords

Acknowledgments

The authors thank Yu Bai and Song Mei for pointing out a technical issue in the first version of this paper regarding the ℓ2 eluder technique.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Generalized Eluder Coefficient: A Unified Framework for Interactive Decision Making in Markov Decision Processes, Partially Observable Markov Decision Processes, and Beyond

References

Articles In Advance

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News