Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

Published Online:https://doi.org/10.1287/opre.2022.0511

References

  • Adusumilli K, Eckardt D (2019) Temporal-difference estimation of dynamic discrete choice models. Preprint, submitted December 19, https://arxiv.org/abs/1912.09509.Google Scholar
  • Aguirregabiria V, Mira P (2002) Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models. Econometrica 70(4):1519–1543.CrossrefGoogle Scholar
  • Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370.CrossrefGoogle Scholar
  • Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. Learn. Theory (PMLR, New York), 1691–1692.Google Scholar
  • Borkar VS (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.CrossrefGoogle Scholar
  • Cayci S, He N, Srikant R (2021) Linear convergence of entropy-regularized natural policy gradient with linear function approximation. Preprint, submitted June 8, https://arxiv.org/abs/2106.04096.Google Scholar
  • Cen S, Cheng C, Chen Y, Wei Y, Chi Y (2022) Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4):2563–2578.LinkGoogle Scholar
  • Chen T, Sun Y, Yin W (2021) Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Adv. Neural Inform. Processing Systems 34:25294–25307.Google Scholar
  • Chernozhukov V, Escanciano JC, Ichimura H, Newey WK, Robins JM (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.CrossrefGoogle Scholar
  • Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Fu J, Luo K, Levine S (2017) Learning robust rewards with adversarial inverse reinforcement learning. Preprint, submitted October 30, https://arxiv.org/abs/1710.11248.Google Scholar
  • Gangwani T, Peng J (2020) State-only imitation with transition dynamics mismatch. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Garg D, Chakraborty S, Cundy C, Song J, Ermon S (2021) Iq-learn: Inverse soft-q learning for imitation. Adv. Neural Inform. Processing Systems 34:4028–4039.Google Scholar
  • Guan Z, Xu T, Liang Y (2021) When will generative adversarial imitation learning algorithms attain global convergence. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1117–1125.Google Scholar
  • Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1352–1361.Google Scholar
  • Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1861–1870.Google Scholar
  • Hansen LP, Miao J (2018) Aversion to ambiguity and model misspecification in dynamic stochastic environments. Proc. Natl. Acad. Sci. USA 115(37):9163–9168.CrossrefGoogle Scholar
  • Ho J, Ermon S (2016) Generative adversarial imitation learning. Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4572–4580.Google Scholar
  • Hong M, Wai HT, Wang Z, Yang Z (2020) A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. Preprint, submitted July 10, https://arxiv.org/abs/2007.05170.Google Scholar
  • Hotz VJ, Miller RA (1993) Conditional choice probabilities and the estimation of dynamic models. Rev. Econom. Stud. 60(3):497–529.CrossrefGoogle Scholar
  • Hotz VJ, Miller RA, Sanders S, Smith J (1994) A simulation estimator for dynamic models of discrete choice. Rev. Econom. Stud. 61:265–289.CrossrefGoogle Scholar
  • Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8580–8589.Google Scholar
  • Jin C, Netrapalli P, Jordan M (2020) What is local optimality in nonconvex-nonconcave minimax optimization? Proc. Internat. Conf. Machine Learn. (PMLR, New York), 4880–4889.Google Scholar
  • Khanduri P, Zeng S, Hong M, Wai HT, Wang Z, Yang Z (2021) A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inform. Processing Systems 34:30271–30283.Google Scholar
  • Kiran BR, Sobh I, Talpaert V, Mannion P, Al Sallab AA, Yogamani S, Pérez P (2021) Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intelligent Transportation Systems 23(6):4909–4926.Google Scholar
  • Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
  • Liu F, Ling Z, Mu T, Su H (2020) State alignment-based imitation learning. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Mai T, Jaillet P (2020) A relation analysis of Markov decision process frameworks. Preprint, submitted August 18, https://arxiv.org/abs/2008.07820.Google Scholar
  • Matějka F, McKay A (2015) Rational inattention to discrete choices: A new foundation for the multinomial logit model. Amer. Econom. Rev. 105(1):272–298.CrossrefGoogle Scholar
  • Ni T, Sikchi H, Wang Y, Gupta T, Lee L, Eysenbach B (2020) f-irl: Inverse reinforcement learning via state marginal matching. Preprint, submitted November 9, https://arxiv.org/abs/2011.04709.Google Scholar
  • Ortega PA, Braun DA (2013) Thermodynamics as a theory of decision-making with information-processing costs. Proc. A 469(2153):20120683.Google Scholar
  • Pomerleau DA (1988) ALVINN: An autonomous land vehicle in a neural network. Proc. 1st Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 305–313.Google Scholar
  • Reich G (2018) Divide and conquer: Recursive likelihood function integration for hidden Markov models with continuous latent variables. Oper. Res. 66(6):1457–1470.LinkGoogle Scholar
  • Rust J (1987) Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica 55(5):999–1033.CrossrefGoogle Scholar
  • Rust J (1994) Structural estimation of Markov decision processes. Handbook of Econometrics, vol. 4 (Elsevier, Amsterdam), 3081–3143.Google Scholar
  • Sanghvi N, Usami S, Sharma M, Groeger J, Kitani K (2021) Inverse reinforcement learning with explicit policy estimates. Proc. Conf. AAAI Artificial Intelligence 35:9472–9480.CrossrefGoogle Scholar
  • Su CL, Judd KL (2012) Constrained optimization approaches to estimation of structural models. Econometrica 80(5):2213–2230.CrossrefGoogle Scholar
  • Tishby N, Polani D (2011) Information theory of decisions and actions. Perception-Action Cycle (Springer, Berlin), 601–636.CrossrefGoogle Scholar
  • Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. Proc. IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IEEE, Piscataway, NJ), 5026–5033.CrossrefGoogle Scholar
  • Viano L, Huang YT, Kamalaruban P, Weller A, Cevher V (2021) Robust inverse reinforcement learning under transition dynamics mismatch. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 25917–25931.Google Scholar
  • Wu YF, Zhang W, Xu P, Gu Q (2020) A finite-time analysis of two time-scale actor-critic methods. Adv. Neural Inform. Processing Systems 33:17617–17628.Google Scholar
  • Wulfmeier M, Ondruska P, Posner I (2015) Maximum entropy deep inverse reinforcement learning. Preprint, submitted July 17, https://arxiv.org/abs/1507.04888.Google Scholar
  • Xu T, Zhe W, Yingbin L (2020) Improving sample complexity bounds for (natural) actor-critic algorithms. Adv. Neural Inform. Processing Sys. 33:4358–4369.Google Scholar
  • Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Survey 55(1):1–36.CrossrefGoogle Scholar
  • Zeng S, Li C, Garcia A, Hong M (2023) When demonstrations meet generative world models: A maximum likelihood framework for offline inverse reinforcement learning. Proc. 37th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 65531–65565.Google Scholar
  • Ziebart BD, Bagnell JA, Dey AK (2010) Modeling interaction via the principle of maximum causal entropy. Proc. Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
  • Ziebart BD, Bagnell JA, Dey AK (2013) The principle of maximum causal entropy for estimating interacting processes. IEEE Trans. Inform. Theory 59(4):1966–1980.CrossrefGoogle Scholar
  • Ziebart BD, Maas AL, Bagnell JA, Dey AK, et al. (2008) Maximum entropy inverse reinforcement learning. Proc. Conf. AAAI Artificial Intelligence 8:1433–1438.Google Scholar
  • Zou S, Xu T, Liang Y (2019) Finite-sample analysis for sarsa with linear function approximation. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.