Variability Sensitive Markov Decision Processes

Published Online:https://doi.org/10.1287/moor.17.3.558

Considered are time-average Markov Decision Processes (MDPs) with finite state and action spaces. Two definitions of variability are introduced, namely, the expected time-average variability and time-average expected variability. The two criteria are in general different, although they can both be employed to penalize for variance in the stream of rewards. For communicating MDPs, we construct a (randomized) stationary policy that is ε-optimal for both criteria; the policy is optimal and pure for a specific variability function. For general multichain MDPs, a state space decomposition leads to a similar result for the expected time-average variability. We also consider the problem of the decision maker choosing the initial state along with the policy.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.