Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

Siliang Zeng
Siliang Zeng
[email protected]
https://orcid.org/0009-0006-0765-5028
Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota 55455
Search for more papers by this author
,
Mingyi Hong
Corresponding Author
Mingyi Hong
[email protected]
https://orcid.org/0000-0003-1263-9365
Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota 55455
Search for more papers by this author
,
Alfredo Garcia
Alfredo Garcia
[email protected]
https://orcid.org/0000-0002-2761-7479
Department of Industrial and Systems Engineering, Texas A&M University College of Engineering, College Station, Texas 77843
Search for more papers by this author

Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota 55455

Search for more papers by this author

Mingyi Hong

Corresponding Author

Mingyi Hong

[email protected]

https://orcid.org/0000-0003-1263-9365

Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, Minnesota 55455

Search for more papers by this author

Alfredo Garcia

[email protected]

https://orcid.org/0000-0002-2761-7479

Department of Industrial and Systems Engineering, Texas A&M University College of Engineering, College Station, Texas 77843

Search for more papers by this author

Published Online:19 Sep 2024https://doi.org/10.1287/opre.2022.0511

References

Adusumilli K, Eckardt D (2019) Temporal-difference estimation of dynamic discrete choice models. Preprint, submitted December 19, https://arxiv.org/abs/1912.09509.Google Scholar
Aguirregabiria V, Mira P (2002) Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models. Econometrica 70(4):1519–1543.Crossref, Google Scholar
Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370.Crossref, Google Scholar
Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. Learn. Theory (PMLR, New York), 1691–1692.Google Scholar
Borkar VS (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Crossref, Google Scholar
Cayci S, He N, Srikant R (2021) Linear convergence of entropy-regularized natural policy gradient with linear function approximation. Preprint, submitted June 8, https://arxiv.org/abs/2106.04096.Google Scholar
Cen S, Cheng C, Chen Y, Wei Y, Chi Y (2022) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.Link, Google Scholar
Chen T, Sun Y, Yin W (2021) Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Adv. Neural Inform. Processing Systems 34:25294–25307.Google Scholar
Chernozhukov V, Escanciano JC, Ichimura H, Newey WK, Robins JM (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.Crossref, Google Scholar
Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Fu J, Luo K, Levine S (2017) Learning robust rewards with adversarial inverse reinforcement learning. Preprint, submitted October 30, https://arxiv.org/abs/1710.11248.Google Scholar
Gangwani T, Peng J (2020) State-only imitation with transition dynamics mismatch. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Garg D, Chakraborty S, Cundy C, Song J, Ermon S (2021) Iq-learn: Inverse soft-q learning for imitation. Adv. Neural Inform. Processing Systems 34:4028–4039.Google Scholar
Guan Z, Xu T, Liang Y (2021) When will generative adversarial imitation learning algorithms attain global convergence. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1117–1125.Google Scholar
Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1352–1361.Google Scholar
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1861–1870.Google Scholar
Hansen LP, Miao J (2018) Aversion to ambiguity and model misspecification in dynamic stochastic environments. Proc. Natl. Acad. Sci. USA 115(37):9163–9168.Crossref, Google Scholar
Ho J, Ermon S (2016) Generative adversarial imitation learning. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4572–4580.Google Scholar
Hong M, Wai HT, Wang Z, Yang Z (2020) A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. Preprint, submitted July 10, https://arxiv.org/abs/2007.05170.Google Scholar
Hotz VJ, Miller RA (1993) Conditional choice probabilities and the estimation of dynamic models. Rev. Econom. Stud. 60(3):497–529.Crossref, Google Scholar
Hotz VJ, Miller RA, Sanders S, Smith J (1994) A simulation estimator for dynamic models of discrete choice. Rev. Econom. Stud. 61:265–289.Crossref, Google Scholar
Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8580–8589.Google Scholar
Jin C, Netrapalli P, Jordan M (2020) What is local optimality in nonconvex-nonconcave minimax optimization? Proc. Internat. Conf. Machine Learn. (PMLR, New York), 4880–4889.Google Scholar
Khanduri P, Zeng S, Hong M, Wai HT, Wang Z, Yang Z (2021) A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inform. Processing Systems 34:30271–30283.Google Scholar
Kiran BR, Sobh I, Talpaert V, Mannion P, Al Sallab AA, Yogamani S, Pérez P (2021) Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intelligent Transportation Systems 23(6):4909–4926.Google Scholar
Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
Liu F, Ling Z, Mu T, Su H (2020) State alignment-based imitation learning. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Mai T, Jaillet P (2020) A relation analysis of Markov decision process frameworks. Preprint, submitted August 18, https://arxiv.org/abs/2008.07820.Google Scholar
Matějka F, McKay A (2015) Rational inattention to discrete choices: A new foundation for the multinomial logit model. Amer. Econom. Rev. 105(1):272–298.Crossref, Google Scholar
Ni T, Sikchi H, Wang Y, Gupta T, Lee L, Eysenbach B (2020) f-irl: Inverse reinforcement learning via state marginal matching. Preprint, submitted November 9, https://arxiv.org/abs/2011.04709.Google Scholar
Ortega PA, Braun DA (2013) Thermodynamics as a theory of decision-making with information-processing costs. Proc. A 469(2153):20120683.Google Scholar
Pomerleau DA (1988) ALVINN: An autonomous land vehicle in a neural network. Proc. 1st Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 305–313.Google Scholar
Reich G (2018) Divide and conquer: Recursive likelihood function integration for hidden Markov models with continuous latent variables. Oper. Res. 66(6):1457–1470.Link, Google Scholar
Rust J (1987) Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica 55(5):999–1033.Crossref, Google Scholar
Rust J (1994) Structural estimation of Markov decision processes. Handbook of Econometrics, vol. 4 (Elsevier, Amsterdam), 3081–3143.Google Scholar
Sanghvi N, Usami S, Sharma M, Groeger J, Kitani K (2021) Inverse reinforcement learning with explicit policy estimates. Proc. Conf. AAAI Artificial Intelligence 35:9472–9480.Crossref, Google Scholar
Su CL, Judd KL (2012) Constrained optimization approaches to estimation of structural models. Econometrica 80(5):2213–2230.Crossref, Google Scholar
Tishby N, Polani D (2011) Information theory of decisions and actions. Perception-Action Cycle (Springer, Berlin), 601–636.Crossref, Google Scholar
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.Crossref, Google Scholar
Viano L, Huang YT, Kamalaruban P, Weller A, Cevher V (2021) Robust inverse reinforcement learning under transition dynamics mismatch. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 25917–25931.Google Scholar
Wu YF, Zhang W, Xu P, Gu Q (2020) A finite-time analysis of two time-scale actor-critic methods. Adv. Neural Inform. Processing Systems 33:17617–17628.Google Scholar
Wulfmeier M, Ondruska P, Posner I (2015) Maximum entropy deep inverse reinforcement learning. Preprint, submitted July 17, https://arxiv.org/abs/1507.04888.Google Scholar
Xu T, Zhe W, Yingbin L (2020) Improving sample complexity bounds for (natural) actor-critic algorithms. Adv. Neural Inform. Processing Sys. 33:4358–4369.Google Scholar
Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Survey 55(1):1–36.Crossref, Google Scholar
Zeng S, Li C, Garcia A, Hong M (2023) When demonstrations meet generative world models: A maximum likelihood framework for offline inverse reinforcement learning. Proc. 37th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 65531–65565.Google Scholar
Ziebart BD, Bagnell JA, Dey AK (2010) Modeling interaction via the principle of maximum causal entropy. Proc. Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
Ziebart BD, Bagnell JA, Dey AK (2013) The principle of maximum causal entropy for estimating interacting processes. IEEE Trans. Inform. Theory 59(4):1966–1980.Crossref, Google Scholar
Ziebart BD, Maas AL, Bagnell JA, Dey AK, et al. (2008) Maximum entropy inverse reinforcement learning. Proc. Conf. AAAI Artificial Intelligence 8:1433–1438.Google Scholar
Zou S, Xu T, Liang Y (2019) Finite-sample analysis for sarsa with linear function approximation. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar

Volume 73, Issue 2

March-April 2025

Pages iii-viii, 583-1150, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:September 30, 2022
Accepted:July 04, 2024
Published Online:September 19, 2024

Cite as

Siliang Zeng; , Mingyi Hong; , Alfredo Garcia (2024) Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees. Operations Research 73(2):720-737.

https://doi.org/10.1287/opre.2022.0511

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

References

Volume 73, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News