Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Published Online:https://doi.org/10.1287/mnsc.2022.04112

References

  • Angrist J, Imbens G (1995) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
  • Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91(434):444–455.CrossrefGoogle Scholar
  • Bennett A, Kallus N, Li L, Mousavi A (2021) Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (PMLR, New York), 1999–2007.Google Scholar
  • Cai Q, Yang Z, Jin C, Wang Z (2020) Provably efficient exploration in policy optimization. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 1283–1294.Google Scholar
  • Chandak Y, Theocharous G, Kostas J, Jordan S, Thomas P (2019) Learning action representations for reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (PMLR, New York), 941–950.Google Scholar
  • Chen X, Pouzo D (2012) Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80(1):277–321.CrossrefGoogle Scholar
  • Chen X, Reiss M (2011) On rate optimality for ill-posed inverse problems in econometrics. Econom. Theory 27(3):497–521.CrossrefGoogle Scholar
  • Chen S, Zhang B (2023) Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. J. Royal Statist. Soc. Series B, Statist. Methodology 85(2):427–453.Google Scholar
  • Cheng C-A, Xie T, Jiang N, Agarwal A (2022) Adversarially trained actor critic for offline reinforcement learning. Kamalika C, Stefanie J, Le S, Csaba S, Gang N, Sivan S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 3852–3878.Google Scholar
  • Farinelli A, Iocchi L, Nardi D (2004) Multirobot systems: A classification focused on coordination. IEEE Trans. Systems Man Cybernetics B Cybernetics 34(5):2015–2028.CrossrefGoogle Scholar
  • Fu Z, Qi Z, Wang Z, Yang Z, Xu Y, Kosorok MR (2022) Offline reinforcement learning with instrumental variables in confounded Markov decision processes. Preprint, submitted September 18, https://arxiv.org/abs/2209.08666.Google Scholar
  • Gorecky D, Schmitt M, Loskyll M, Zühlke D (2014) Human-machine-interaction in the industry 4.0 era. Proc. 12th IEEE Internat. Conf. Industrial Inform. (IEEE, Piscataway, NJ), 289–294.Google Scholar
  • Guo H, Fu Z, Yang Z, Wang Z (2021) Decentralized single-timescale actor-critic on zero-sum two-player stochastic games. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 3899–3909.Google Scholar
  • Hambly B, Xu R, Yang H (2023) Policy gradient methods find the Nash equilibrium in n-player general-sum linear-quadratic games. J. Machine Learn. Res. 24:1–56.Google Scholar
  • Hernán MA, Robins JM (2020) Causal Inference: What If (Chapman & Hall/CRC, Boca Raton, FL).Google Scholar
  • Hoc J-M (2000) From human–machine interaction to human–machine cooperation. Ergonomics 43(7):833–843.CrossrefGoogle Scholar
  • Hong M, Qi Z, Xu Y (2024) A policy gradient method for confounded POMDPs. Twelfth Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Huang B, Lee JD, Wang Z, Yang Z (2022) Towards general function approximation in zero-sum Markov games. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Jin Y, Yang Z, Wang Z (2021) Is pessimism provably efficient for offline RL? Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 5084–5096.Google Scholar
  • Jin C, Yang Z, Wang Z, Jordan MI (2023) Provably efficient reinforcement learning with linear function approximation. Math. Oper. Res. 48(3):1496–1521.Google Scholar
  • Kallus N, Zhou A (2018) Confounding-robust policy improvement. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Kallus N, Zhou A (2020) Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 22293–22304.Google Scholar
  • Kingma DP (2015) Adam: A method for stochastic optimization. Bengio Y, LeCun Y, eds. 3rd Internat. Conf. Learn. Representations, ICLR 2015 (San Diego).Google Scholar
  • Laffont J-J, Martimort D (2009) The theory of incentives: The principal-agent model. The Theory of Incentives (Princeton University Press, Princeton, NJ).CrossrefGoogle Scholar
  • Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
  • Lewbel A (2012) Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. J. Bus. Econom. Statist. 30(1):67–80.CrossrefGoogle Scholar
  • Lewbel A (2018) Identification and estimation using heteroscedasticity without instruments: The binary endogenous regressor case. Econom. Lett. 165:10–12.CrossrefGoogle Scholar
  • Lewis M, Yarats D, Dauphin YN, Parikh D, Batra D (2017) Deal or no deal? End-to-end learning for negotiation dialogues. Preprint, submitted June 16, https://arxiv.org/abs/1706.05125.Google Scholar
  • Liao P, Qi Z, Wan R, Klasnja P, Murphy SA (2022) Batch policy learning in average reward Markov decision processes. Ann. Statist. 50(6):3364.CrossrefGoogle Scholar
  • Liao L, Fu Z, Yang Z, Wang Y, Ma D, Kolar M, Wang Z (2024) Instrumental variable value iteration for causal offline reinforcement learning. J. Machine Learn. Res. 25(303):1–56.Google Scholar
  • Littman ML (2001) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.CrossrefGoogle Scholar
  • Lu M, Min Y, Wang Z, Yang Z (2022) Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable Markov decision processes. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 8016–8038.Google Scholar
  • Luo Q, Saigal R, Chen Z, Yin Y (2019) Accelerating the adoption of automated vehicles by subsidies: A dynamic games approach. Transportation Res. Part B Methodological 129:226–243.CrossrefGoogle Scholar
  • Mertikopoulos P, Zhou Z (2019) Learning in games with continuous action sets and unknown payoff functions. Math. Programming 173(1):465–507.CrossrefGoogle Scholar
  • Miao R, Qi Z, Zhang X (2022) Off-policy evaluation for episodic partially observable markov decision processes under non-parametric models. Oh AH, Agarwal A, Belgrave D, Cho K, eds. Advances in Neural Information Processing Systems (OpenReview.net).Google Scholar
  • Namkoong H, Keramati R, Yadlowsky S, Brunskill E (2020) Off-policy policy evaluation for sequential decisions under unobserved confounding. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 18819–18831.Google Scholar
  • Pearl J (2009) Causality: Models, Reasoning and Inference, 2nd ed. (Cambridge University Press, Cambridge, MA).Google Scholar
  • Pérolat J, Piot B, Pietquin O (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 84 (PMLR, New York), 919–928.Google Scholar
  • Pérolat J, Piot B, Scherrer B, Pietquin O (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 51 (PMLR, New York), 893–901.Google Scholar
  • Pérolat J, Scherrer B, Piot B, Pietquin O (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 37 (PMLR, New York), 1321–1329.Google Scholar
  • Pérolat J, Piot B, Geist M, Scherrer B, Pietquin O (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 1860–1868.Google Scholar
  • Peters J, Janzing D, Schölkopf B (2017) Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, Cambridge, MA).Google Scholar
  • Sabater J, Sierra C (2005) Review on computational trust and reputation models. Artificial Intelligence Rev. 24(1):33–60.CrossrefGoogle Scholar
  • Shi C, Uehara M, Jiang N (2022a) A minimax learning approach to off-policy evaluation in partially observable Markov decision processes. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 20057–20094.Google Scholar
  • Shi C, Zhu J, Ye S, Luo S, Zhu H, Song R (2022b) Off-policy confidence interval estimation with confounded Markov decision process. J. Amer. Statist. Assoc. 119(545):273–284.Google Scholar
  • Srinivasan S, Lanctot M, Zambaldi V, Pérolat J, Tuyls K, Munos R, Bowling M (2018) Actor-critic policy optimization in partially observable multiagent environments. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Sun R (2006) Cognition and Multi-Agent Interaction: From Cognitive Modeling to Social Simulation (Cambridge University Press, Cambridge, UK).Google Scholar
  • Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
  • Tchetgen ET, Sun B, Walter S (2021) The genius approach to robust mendelian randomization inference. Statist. Sci. 36(3):443–464.CrossrefGoogle Scholar
  • Wang L, Tchetgen Tchetgen E (2018) Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J. Roy. Statist. Soc. Ser. B Statist. Methodology 80(3):531–550.CrossrefGoogle Scholar
  • Wang L, Yang Z, Wang Z (2021) Provably efficient causal reinforcement learning with confounded observational data. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 21164–21175.Google Scholar
  • Xie T, Cheng C-A, Jiang N, Mineiro P, Agarwal A (2021) Bellman-consistent pessimism for offline reinforcement learning. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 6683–6694.Google Scholar
  • Yan Z, Jouandeau N, Cherif AA (2023) A survey and analysis of multi-robot coordination. Internat. J. Adv. Robotic Systems 10(12):399.CrossrefGoogle Scholar
  • Zhong H, Yang Z, Wang Z, Jordan MI (2023) Can reinforcement learning find Stackelberg-Nash equilibria in general-sum Markov games with myopic followers? J. Machine Learn. Res. 24(35):1–52.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.