Global Algorithms for Mean-Variance Optimization in Markov Decision Processes
Abstract
Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to finding the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance, and the outer one is a single-parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm that can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. Our results are also valid for solely minimizing the variance metrics and can shed light on solving other varied forms of mean-variance MDPs.
Funding: This research was supported in part by the National Key Research and Development Program of China [Grant 2022YFA1004600], the National Natural Science Foundation of China [Grants 72342006 and 72371253], the Guangdong Basic and Applied Basic Research Foundation [Grants 2023A1515012492 and 2023B1515040001], the Regional Joint Foundation of Guangdong [Grant 2022A1515110725], and the Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University. The work described in this paper was also partially supported by the InnoHK Initiative, the Government of the HKSAR, and the Laboratory for AI-Powered Financial Technologies.

