Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Zuyue Fu
Zuyue Fu
[email protected]
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Zhengling Qi
Corresponding Author
Zhengling Qi
[email protected]
https://orcid.org/0000-0003-0270-7969
Department of Decision Sciences, George Washington University, Washington, District of Columbia 20052
Search for more papers by this author
,
Zhuoran Yang
Zhuoran Yang
[email protected]
Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511
Search for more papers by this author
,
Zhaoran Wang
Zhaoran Wang
[email protected]
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Lan Wang
Lan Wang
[email protected]
https://orcid.org/0000-0002-3217-0202
Department of Management Science, University of Miami, Coral Gables, Florida 33146
Search for more papers by this author

Zuyue Fu

[email protected]

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Zhengling Qi

Corresponding Author

Zhengling Qi

[email protected]

https://orcid.org/0000-0003-0270-7969

Department of Decision Sciences, George Washington University, Washington, District of Columbia 20052

Search for more papers by this author

Zhuoran Yang

[email protected]

Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06511

Search for more papers by this author

Zhaoran Wang

[email protected]

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Lan Wang

[email protected]

https://orcid.org/0000-0002-3217-0202

Department of Management Science, University of Miami, Coral Gables, Florida 33146

Search for more papers by this author

Published Online:6 Aug 2025https://doi.org/10.1287/mnsc.2022.04112

References

Angrist J, Imbens G (1995) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91(434):444–455.Crossref, Google Scholar
Bennett A, Kallus N, Li L, Mousavi A (2021) Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (PMLR, New York), 1999–2007.Google Scholar
Cai Q, Yang Z, Jin C, Wang Z (2020) Provably efficient exploration in policy optimization. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 1283–1294.Google Scholar
Chandak Y, Theocharous G, Kostas J, Jordan S, Thomas P (2019) Learning action representations for reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (PMLR, New York), 941–950.Google Scholar
Chen X, Pouzo D (2012) Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80(1):277–321.Crossref, Google Scholar
Chen X, Reiss M (2011) On rate optimality for ill-posed inverse problems in econometrics. Econom. Theory 27(3):497–521.Crossref, Google Scholar
Chen S, Zhang B (2023) Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. J. Royal Statist. Soc. Series B, Statist. Methodology 85(2):427–453.Google Scholar
Cheng C-A, Xie T, Jiang N, Agarwal A (2022) Adversarially trained actor critic for offline reinforcement learning. Kamalika C, Stefanie J, Le S, Csaba S, Gang N, Sivan S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 3852–3878.Google Scholar
Farinelli A, Iocchi L, Nardi D (2004) Multirobot systems: A classification focused on coordination. IEEE Trans. Systems Man Cybernetics B Cybernetics 34(5):2015–2028.Crossref, Google Scholar
Fu Z, Qi Z, Wang Z, Yang Z, Xu Y, Kosorok MR (2022) Offline reinforcement learning with instrumental variables in confounded Markov decision processes. Preprint, submitted September 18, https://arxiv.org/abs/2209.08666.Google Scholar
Gorecky D, Schmitt M, Loskyll M, Zühlke D (2014) Human-machine-interaction in the industry 4.0 era. Proc. 12th IEEE Internat. Conf. Industrial Inform. (IEEE, Piscataway, NJ), 289–294.Google Scholar
Guo H, Fu Z, Yang Z, Wang Z (2021) Decentralized single-timescale actor-critic on zero-sum two-player stochastic games. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 3899–3909.Google Scholar
Hambly B, Xu R, Yang H (2023) Policy gradient methods find the Nash equilibrium in n-player general-sum linear-quadratic games. J. Machine Learn. Res. 24:1–56.Google Scholar
Hernán MA, Robins JM (2020) Causal Inference: What If (Chapman & Hall/CRC, Boca Raton, FL).Google Scholar
Hoc J-M (2000) From human–machine interaction to human–machine cooperation. Ergonomics 43(7):833–843.Crossref, Google Scholar
Hong M, Qi Z, Xu Y (2024) A policy gradient method for confounded POMDPs. Twelfth Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Huang B, Lee JD, Wang Z, Yang Z (2022) Towards general function approximation in zero-sum Markov games. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
Jin Y, Yang Z, Wang Z (2021) Is pessimism provably efficient for offline RL? Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 5084–5096.Google Scholar
Jin C, Yang Z, Wang Z, Jordan MI (2023) Provably efficient reinforcement learning with linear function approximation. Math. Oper. Res. 48(3):1496–1521.Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
Kallus N, Zhou A (2020) Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 22293–22304.Google Scholar
Kingma DP (2015) Adam: A method for stochastic optimization. Bengio Y, LeCun Y, eds. 3rd Internat. Conf. Learn. Representations, ICLR 2015 (San Diego).Google Scholar
Laffont J-J, Martimort D (2009) The theory of incentives: The principal-agent model. The Theory of Incentives (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
Lewbel A (2012) Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. J. Bus. Econom. Statist. 30(1):67–80.Crossref, Google Scholar
Lewbel A (2018) Identification and estimation using heteroscedasticity without instruments: The binary endogenous regressor case. Econom. Lett. 165:10–12.Crossref, Google Scholar
Lewis M, Yarats D, Dauphin YN, Parikh D, Batra D (2017) Deal or no deal? End-to-end learning for negotiation dialogues. Preprint, submitted June 16, https://arxiv.org/abs/1706.05125.Google Scholar
Liao P, Qi Z, Wan R, Klasnja P, Murphy SA (2022) Batch policy learning in average reward Markov decision processes. Ann. Statist. 50(6):3364.Crossref, Google Scholar
Liao L, Fu Z, Yang Z, Wang Y, Ma D, Kolar M, Wang Z (2024) Instrumental variable value iteration for causal offline reinforcement learning. J. Machine Learn. Res. 25(303):1–56.Google Scholar
Littman ML (2001) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.Crossref, Google Scholar
Lu M, Min Y, Wang Z, Yang Z (2022) Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable Markov decision processes. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 8016–8038.Google Scholar
Luo Q, Saigal R, Chen Z, Yin Y (2019) Accelerating the adoption of automated vehicles by subsidies: A dynamic games approach. Transportation Res. Part B Methodological 129:226–243.Crossref, Google Scholar
Mertikopoulos P, Zhou Z (2019) Learning in games with continuous action sets and unknown payoff functions. Math. Programming 173(1):465–507.Crossref, Google Scholar
Miao R, Qi Z, Zhang X (2022) Off-policy evaluation for episodic partially observable markov decision processes under non-parametric models. Oh AH, Agarwal A, Belgrave D, Cho K, eds. Advances in Neural Information Processing Systems (OpenReview.net).Google Scholar
Namkoong H, Keramati R, Yadlowsky S, Brunskill E (2020) Off-policy policy evaluation for sequential decisions under unobserved confounding. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 18819–18831.Google Scholar
Pearl J (2009) Causality: Models, Reasoning and Inference, 2nd ed. (Cambridge University Press, Cambridge, MA).Google Scholar
Pérolat J, Piot B, Pietquin O (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 84 (PMLR, New York), 919–928.Google Scholar
Pérolat J, Piot B, Scherrer B, Pietquin O (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 51 (PMLR, New York), 893–901.Google Scholar
Pérolat J, Scherrer B, Piot B, Pietquin O (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 37 (PMLR, New York), 1321–1329.Google Scholar
Pérolat J, Piot B, Geist M, Scherrer B, Pietquin O (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 1860–1868.Google Scholar
Peters J, Janzing D, Schölkopf B (2017) Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, Cambridge, MA).Google Scholar
Sabater J, Sierra C (2005) Review on computational trust and reputation models. Artificial Intelligence Rev. 24(1):33–60.Crossref, Google Scholar
Shi C, Uehara M, Jiang N (2022a) A minimax learning approach to off-policy evaluation in partially observable Markov decision processes. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 20057–20094.Google Scholar
Shi C, Zhu J, Ye S, Luo S, Zhu H, Song R (2022b) Off-policy confidence interval estimation with confounded Markov decision process. J. Amer. Statist. Assoc. 119(545):273–284.Google Scholar
Srinivasan S, Lanctot M, Zambaldi V, Pérolat J, Tuyls K, Munos R, Bowling M (2018) Actor-critic policy optimization in partially observable multiagent environments. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
Sun R (2006) Cognition and Multi-Agent Interaction: From Cognitive Modeling to Social Simulation (Cambridge University Press, Cambridge, UK).Google Scholar
Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
Tchetgen ET, Sun B, Walter S (2021) The genius approach to robust mendelian randomization inference. Statist. Sci. 36(3):443–464.Crossref, Google Scholar
Wang L, Tchetgen Tchetgen E (2018) Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J. Roy. Statist. Soc. Ser. B Statist. Methodology 80(3):531–550.Crossref, Google Scholar
Wang L, Yang Z, Wang Z (2021) Provably efficient causal reinforcement learning with confounded observational data. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 21164–21175.Google Scholar
Xie T, Cheng C-A, Jiang N, Mineiro P, Agarwal A (2021) Bellman-consistent pessimism for offline reinforcement learning. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 6683–6694.Google Scholar
Yan Z, Jouandeau N, Cherif AA (2023) A survey and analysis of multi-robot coordination. Internat. J. Adv. Robotic Systems 10(12):399.Crossref, Google Scholar
Zhong H, Yang Z, Wang Z, Jordan MI (2023) Can reinforcement learning find Stackelberg-Nash equilibria in general-sum Markov games with myopic followers? J. Machine Learn. Res. 24(35):1–52.Google Scholar

Volume 72, Issue 1

January 2026

Pages 1-782, iv-vi

Article Information

Supplemental Material

Metrics

Information

Received:December 21, 2022
Accepted:December 27, 2024
Published Online:August 06, 2025

Cite as

Zuyue Fu, Zhengling Qi, Zhuoran Yang, Zhaoran Wang, Lan Wang (2025) Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information. Management Science 72(1):646-666.

https://doi.org/10.1287/mnsc.2022.04112

Keywords

Acknowledgments

The authors thank the department editor, associate editor, and three reviewers for helpful comments, constructive suggestions, and insightful feedback that significantly improved this manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

References

Volume 72, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News