Variance-Penalized Markov Decision Processes

Published Online:https://doi.org/10.1287/moor.14.1.147

We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream of rewards. In both cases we formulate appropriate nonlinear programs in the space of state-action frequencies (averaged, or discounted) whose optimal solutions are shown to be related to the optimal policies in the corresponding “variance-penalized MDP.” The analysis of one of the discounted cases is facilitated by the introduction of a “Cartesian product of two independent MDPs.”

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.