Multiple Policy Improvements in Undiscounted Markov Renewal Programming
Abstract
This paper examines, for undiscounted unichain Markov renewal programming, both the Hastings policy-value iteration algorithm and the case of multiple policy improvements between each policy evaluation. The modified policy improvement procedure proposed by Hastings either increases the gain rate or maintains it, and has a larger value improvement in some transient state than in all recurrent states. This prevents cycling and ensures convergence of the policy-value iteration algorithm. Multiple policy improvements, using either the unmodified or modified policy-improvement procedure, are shown to settle ultimately upon higher-gain policies, if any exist. The iterated policy improvements, each time using the improved values, also lead to upper and lower bounds on the maximal gain rate.

