Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees
Published Online:19 Sep 2024https://doi.org/10.1287/opre.2022.0511
References
- (2019) Temporal-difference estimation of dynamic discrete choice models. Preprint, submitted December 19, https://arxiv.org/abs/1912.09509.Google Scholar
- (2002) Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models. Econometrica 70(4):1519–1543.Crossref, Google Scholar
- (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370.Crossref, Google Scholar
- (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. Learn. Theory (PMLR, New York), 1691–1692.Google Scholar
- (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Crossref, Google Scholar
- (2021) Linear convergence of entropy-regularized natural policy gradient with linear function approximation. Preprint, submitted June 8, https://arxiv.org/abs/2106.04096.Google Scholar
- (2022) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.Link, Google Scholar
- (2021) Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Adv. Neural Inform. Processing Systems 34:25294–25307.Google Scholar
- (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.Crossref, Google Scholar
- (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2017) Learning robust rewards with adversarial inverse reinforcement learning. Preprint, submitted October 30, https://arxiv.org/abs/1710.11248.Google Scholar
- (2020) State-only imitation with transition dynamics mismatch. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2021) Iq-learn: Inverse soft-q learning for imitation. Adv. Neural Inform. Processing Systems 34:4028–4039.Google Scholar
- (2021) When will generative adversarial imitation learning algorithms attain global convergence. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1117–1125.Google Scholar
- (2017) Reinforcement learning with deep energy-based policies. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1352–1361.Google Scholar
- (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1861–1870.Google Scholar
- (2018) Aversion to ambiguity and model misspecification in dynamic stochastic environments. Proc. Natl. Acad. Sci. USA 115(37):9163–9168.Crossref, Google Scholar
- (2016) Generative adversarial imitation learning. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4572–4580.Google Scholar
- (2020) A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. Preprint, submitted July 10, https://arxiv.org/abs/2007.05170.Google Scholar
- (1993) Conditional choice probabilities and the estimation of dynamic models. Rev. Econom. Stud. 60(3):497–529.Crossref, Google Scholar
- (1994) A simulation estimator for dynamic models of discrete choice. Rev. Econom. Stud. 61:265–289.Crossref, Google Scholar
- (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8580–8589.Google Scholar
- (2020) What is local optimality in nonconvex-nonconcave minimax optimization? Proc. Internat. Conf. Machine Learn. (PMLR, New York), 4880–4889.Google Scholar
- (2021) A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inform. Processing Systems 34:30271–30283.Google Scholar
- (2021) Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intelligent Transportation Systems 23(6):4909–4926.Google Scholar
- (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
- (2020) State alignment-based imitation learning. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2020) A relation analysis of Markov decision process frameworks. Preprint, submitted August 18, https://arxiv.org/abs/2008.07820.Google Scholar
- (2015) Rational inattention to discrete choices: A new foundation for the multinomial logit model. Amer. Econom. Rev. 105(1):272–298.Crossref, Google Scholar
- (2020) f-irl: Inverse reinforcement learning via state marginal matching. Preprint, submitted November 9, https://arxiv.org/abs/2011.04709.Google Scholar
- (2013) Thermodynamics as a theory of decision-making with information-processing costs. Proc. A 469(2153):20120683.Google Scholar
- (1988) ALVINN: An autonomous land vehicle in a neural network. Proc. 1st Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 305–313.Google Scholar
- (2018) Divide and conquer: Recursive likelihood function integration for hidden Markov models with continuous latent variables. Oper. Res. 66(6):1457–1470.Link, Google Scholar
- (1987) Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica 55(5):999–1033.Crossref, Google Scholar
- (1994) Structural estimation of Markov decision processes. Handbook of Econometrics, vol. 4 (Elsevier, Amsterdam), 3081–3143.Google Scholar
- (2021) Inverse reinforcement learning with explicit policy estimates. Proc. Conf. AAAI Artificial Intelligence 35:9472–9480.Crossref, Google Scholar
- (2012) Constrained optimization approaches to estimation of structural models. Econometrica 80(5):2213–2230.Crossref, Google Scholar
- (2011) Information theory of decisions and actions. Perception-Action Cycle (Springer, Berlin), 601–636.Crossref, Google Scholar
- (2012) Mujoco: A physics engine for model-based control. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.Crossref, Google Scholar
- (2021) Robust inverse reinforcement learning under transition dynamics mismatch. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 25917–25931.Google Scholar
- (2020) A finite-time analysis of two time-scale actor-critic methods. Adv. Neural Inform. Processing Systems 33:17617–17628.Google Scholar
- (2015) Maximum entropy deep inverse reinforcement learning. Preprint, submitted July 17, https://arxiv.org/abs/1507.04888.Google Scholar
- Xu T, Zhe W, Yingbin L (2020) Improving sample complexity bounds for (natural) actor-critic algorithms. Adv. Neural Inform. Processing Sys. 33:4358–4369.Google Scholar
- (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Survey 55(1):1–36.Crossref, Google Scholar
- (2023) When demonstrations meet generative world models: A maximum likelihood framework for offline inverse reinforcement learning. Proc. 37th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 65531–65565.Google Scholar
- (2010) Modeling interaction via the principle of maximum causal entropy. Proc. Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
- (2013) The principle of maximum causal entropy for estimating interacting processes. IEEE Trans. Inform. Theory 59(4):1966–1980.Crossref, Google Scholar
- (2008) Maximum entropy inverse reinforcement learning. Proc. Conf. AAAI Artificial Intelligence 8:1433–1438.Google Scholar
- (2019) Finite-sample analysis for sarsa with linear function approximation. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar

