Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information
References
- (1995) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
- (1996) Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91(434):444–455.Crossref, Google Scholar
- (2021) Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. Proc. 24th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 130 (PMLR, New York), 1999–2007.Google Scholar
- (2020) Provably efficient exploration in policy optimization. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 1283–1294.Google Scholar
- (2019) Learning action representations for reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (PMLR, New York), 941–950.Google Scholar
- (2012) Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80(1):277–321.Crossref, Google Scholar
- (2011) On rate optimality for ill-posed inverse problems in econometrics. Econom. Theory 27(3):497–521.Crossref, Google Scholar
- (2023) Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. J. Royal Statist. Soc. Series B, Statist. Methodology 85(2):427–453.Google Scholar
- (2022) Adversarially trained actor critic for offline reinforcement learning. Kamalika C, Stefanie J, Le S, Csaba S, Gang N, Sivan S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 3852–3878.Google Scholar
- (2004) Multirobot systems: A classification focused on coordination. IEEE Trans. Systems Man Cybernetics B Cybernetics 34(5):2015–2028.Crossref, Google Scholar
- (2022) Offline reinforcement learning with instrumental variables in confounded Markov decision processes. Preprint, submitted September 18, https://arxiv.org/abs/2209.08666.Google Scholar
- (2014) Human-machine-interaction in the industry 4.0 era. Proc. 12th IEEE Internat. Conf. Industrial Inform. (IEEE, Piscataway, NJ), 289–294.Google Scholar
- (2021) Decentralized single-timescale actor-critic on zero-sum two-player stochastic games. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 3899–3909.Google Scholar
- (2023) Policy gradient methods find the Nash equilibrium in n-player general-sum linear-quadratic games. J. Machine Learn. Res. 24:1–56.Google Scholar
- Hernán MA, Robins JM (2020) Causal Inference: What If (Chapman & Hall/CRC, Boca Raton, FL).Google Scholar
- (2000) From human–machine interaction to human–machine cooperation. Ergonomics 43(7):833–843.Crossref, Google Scholar
- (2024) A policy gradient method for confounded POMDPs. Twelfth Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- Huang B, Lee JD, Wang Z, Yang Z (2022) Towards general function approximation in zero-sum Markov games. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2021) Is pessimism provably efficient for offline RL? Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 5084–5096.Google Scholar
- Jin C, Yang Z, Wang Z, Jordan MI (2023) Provably efficient reinforcement learning with linear function approximation. Math. Oper. Res. 48(3):1496–1521.Google Scholar
- (2018) Confounding-robust policy improvement. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2020) Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 22293–22304.Google Scholar
- (2015) Adam: A method for stochastic optimization. Bengio Y, LeCun Y, eds. 3rd Internat. Conf. Learn. Representations, ICLR 2015 (San Diego).Google Scholar
- (2009) The theory of incentives: The principal-agent model. The Theory of Incentives (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
- (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
- (2012) Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. J. Bus. Econom. Statist. 30(1):67–80.Crossref, Google Scholar
- (2018) Identification and estimation using heteroscedasticity without instruments: The binary endogenous regressor case. Econom. Lett. 165:10–12.Crossref, Google Scholar
- (2017) Deal or no deal? End-to-end learning for negotiation dialogues. Preprint, submitted June 16, https://arxiv.org/abs/1706.05125.Google Scholar
- (2022) Batch policy learning in average reward Markov decision processes. Ann. Statist. 50(6):3364.Crossref, Google Scholar
- (2024) Instrumental variable value iteration for causal offline reinforcement learning. J. Machine Learn. Res. 25(303):1–56.Google Scholar
- (2001) Value-function reinforcement learning in Markov games. Cognitive Systems Res. 2(1):55–66.Crossref, Google Scholar
- (2022) Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable Markov decision processes. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 8016–8038.Google Scholar
- (2019) Accelerating the adoption of automated vehicles by subsidies: A dynamic games approach. Transportation Res. Part B Methodological 129:226–243.Crossref, Google Scholar
- (2019) Learning in games with continuous action sets and unknown payoff functions. Math. Programming 173(1):465–507.Crossref, Google Scholar
- (2022) Off-policy evaluation for episodic partially observable markov decision processes under non-parametric models. Oh AH, Agarwal A, Belgrave D, Cho K, eds. Advances in Neural Information Processing Systems (OpenReview.net).Google Scholar
- (2020) Off-policy policy evaluation for sequential decisions under unobserved confounding. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 18819–18831.Google Scholar
- Pearl J (2009) Causality: Models, Reasoning and Inference, 2nd ed. (Cambridge University Press, Cambridge, MA).Google Scholar
- (2018) Actor-critic fictitious play in simultaneous move multistage games. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 84 (PMLR, New York), 919–928.Google Scholar
- (2016b) On the use of non-stationary strategies for solving two-player zero-sum Markov games. Gretton A, Robert CC, eds. Proc. 19th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 51 (PMLR, New York), 893–901.Google Scholar
- (2015) Approximate dynamic programming for two-player zero-sum Markov games. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 37 (PMLR, New York), 1321–1329.Google Scholar
- (2016a) Softened approximate policy iteration for Markov games. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 1860–1868.Google Scholar
- (2017) Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, Cambridge, MA).Google Scholar
- (2005) Review on computational trust and reputation models. Artificial Intelligence Rev. 24(1):33–60.Crossref, Google Scholar
- (2022a) A minimax learning approach to off-policy evaluation in partially observable Markov decision processes. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 162 (PMLR, New York), 20057–20094.Google Scholar
- (2022b) Off-policy confidence interval estimation with confounded Markov decision process. J. Amer. Statist. Assoc. 119(545):273–284.Google Scholar
- (2018) Actor-critic policy optimization in partially observable multiagent environments. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2006) Cognition and Multi-Agent Interaction: From Cognitive Modeling to Social Simulation (Cambridge University Press, Cambridge, UK).Google Scholar
- (1999) Policy gradient methods for reinforcement learning with function approximation. Solla S, Leen T, Müller K, eds. Advances in Neural Information Processing Systems, vol. 12 (MIT Press, Cambridge, MA).Google Scholar
- (2021) The genius approach to robust mendelian randomization inference. Statist. Sci. 36(3):443–464.Crossref, Google Scholar
- (2018) Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J. Roy. Statist. Soc. Ser. B Statist. Methodology 80(3):531–550.Crossref, Google Scholar
- (2021) Provably efficient causal reinforcement learning with confounded observational data. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 21164–21175.Google Scholar
- (2021) Bellman-consistent pessimism for offline reinforcement learning. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 6683–6694.Google Scholar
- (2023) A survey and analysis of multi-robot coordination. Internat. J. Adv. Robotic Systems 10(12):399.Crossref, Google Scholar
- (2023) Can reinforcement learning find Stackelberg-Nash equilibria in general-sum Markov games with myopic followers? J. Machine Learn. Res. 24(35):1–52.Google Scholar

