Likelihood Ratio Gradient Estimation for Steady-State Parameters

We consider a discrete-time Markov chain $\boldsymbol{\Phi}$ on a general state-space ${\sf X}$, whose transition probabilities are parameterized by a real-valued vector $\boldsymbol{\theta}$. Under the assumption that $\boldsymbol{\Phi}$ is geometrically ergodic with corresponding stationary distribution $\pi(\boldsymbol{\theta})$, we are interested in estimating the gradient $\nabla \alpha(\boldsymbol{\theta})$ of the steady-state expectation $$\alpha(\boldsymbol{\theta}) = \pi( \boldsymbol{\theta}) f.$$ To this end, we first give sufficient conditions for the differentiability of $\alpha(\boldsymbol{\theta})$ and for the calculation of its gradient via a sequence of finite horizon expectations. We then propose two different likelihood ratio estimators and analyze their limiting behavior.


Introduction
Consider a discrete-time Markov chain Φ = {Φ k : k ≥ 0} on a general state space X whose transition kernel P (θ) = {P (θ, x, A) : x ∈ X, A ⊆ X} is parameterized by a vector θ ∈ Θ ⊆ R d of continuous parameters. We assume that Φ has a unique invariant distribution π(θ) = {π(θ, A) : A ⊆ X} and that we are interested in computing the gradient of at a specific point θ 0 ∈ Θ, for some function f such that π(θ)|f | < ∞. We consider throughout the paper the geometrically ergodic case, where the conditions for the existence of the gradient ∇α(θ) are stated more concisely and are easier to verify. We then focus on the analysis of two different likelihood ratio estimators, exhibiting desirable limiting behavior, that can be used to approximate the gradient. We refer the reader to [13] for a more thorough analysis of the existence of the gradient under more general conditions. We now give an informal description of the type of estimators that we study; the necessary assumptions will be made precise in the following section. From the strong law of large numbers we can expect that when Φ evolves according to P (θ), then as n → ∞ for any initial distribution. Moreover, if we assume that there exists a family of densities {p(θ, x, y) : x, y ∈ X} such that the transition probabilities satisfy P (θ, x, dy) = p(θ, x, y)P (θ 0 , x, dy), then we can construct the likelihood ratio L n (θ) = n j=1 p(θ, Φ j−1 , Φ j ), n ≥ 1, and use it to compute the expectation of α n via the identity E θ [α n ] = E θ 0 [α n L n (θ)] ; (1.1) here E θ [ · ] denotes the expectation with respect to the transition probabilities P (θ) when the chain is started according to some fixed distribution µ. Details regarding this identity can be found for example in [4], Theorem 1. Next, provided we have uniform integrability, we would have that as n → ∞, and if we can further justify the exchange of derivative and expectation, then We point out that nE θ 0 [α n ] also represents the finite horizon total cost incurred by Φ, and therefore, the calculation of its gradient is interesting in its own right, i.e., not only for its relation to ∇α(θ). For details on the estimation of gradients via likelihood ratios and other methods, as well as a variety of applications in finance, operations research and engineering, we refer the reader to [10,3,2].
The observation made above suggest that one could think of using α n ∇L n (θ) as an estimator for ∇α(θ). Unfortunately, α n ∇L n (θ) fails to converge as n → ∞; in fact, under some additional assumptions, n −1/2 α n ∇L n (θ) converges in distribution to a multivariate normal random variable (see Proposition 3.2). The first of our two proposed estimators, described in detail in Section 3, uses ∇L n (θ 0 ) as a control variate to reduce the variance of α n ∇L n (θ 0 ). The resulting estimator, after choosing the optimal control variate coefficient, is given by (α n − α(θ 0 ))∇L n (θ 0 ), (1.3) and is shown to converge in Proposition 3.
3. An estimator of this type has been shown in [6] to be very successful in practice, where it was used to compute the sensitivities in reaction networks.
Our second estimator, described in detail in Section 4, exploits the martingale structure of ∇L n (θ 0 ) to obtain an alternative representation for E θ 0 [α n ∇L n (θ 0 )] as the expectation of the discrete stochastic integral where the {D k } are the martingale differences in R d satisfying ∇L n (θ 0 ) = n k=1 D k . As is the case with α n ∇L n (θ), this estimator fails to converge on its own (see Proposition 4.1), but can dramatically be improved by centering it with respect to α(θ 0 ). The optimized estimator takes the form (1.5) Moreover, the analysis of the asymptotic variance of (1.3) and (1.5), included in Section 5, shows that (1.5) is a better estimator than (1.3).
The first part of the paper establishes sufficient conditions on the Markov chain Φ and the function f under which ∇α(θ 0 ) exists and the following limit holds Conditions under which the exchange of derivative and expectation in (1.2) is valid can be found in [11], so we will not focus on this point. Once the convergence in (1.6) is established in Section 2, we move on to the analysis of α n ∇L n (θ 0 ) and the control variates estimator given in (1.3); the corresponding limit theorems are stated in Section 3. The limit theorems for the the integral-type estimators given in (1.4) and (1.5) are included in Section 4. To conclude the expository part of the paper, we compute in Section 5 the asymptotic variance of our two proposed estimators. Finally, Section 6 contains the majority of the proofs.

The model
We consider throughout the paper a discrete-time Markov chain Φ = {Φ k : k ≥ 0} on a general state space X equipped with a countably generated σ-field B(X), and governed by the transition kernel P (θ) = {P (θ, x, A) : x ∈ X, A ⊆ X}. We assume that Θ ⊆ R d is a family of continuous parameters for P (θ). Under the conditions given below, the Markov chain will possess a unique stationary distribution π(θ) = {π(θ, A) : A ⊆ X}, and we are interested in estimating the gradient of α(θ) = π(θ)f at some fixed point θ 0 ∈ Θ, for some function f : X → R such that π(θ)|f | < ∞.
In terms of notation, we use νf to denote the expectation of f with respect to measure ν, that is, Similarly, for any Markov transition kernel P we use Whenever the context is clear we denote the gradient of a function g at the point θ 0 by ∇g(θ 0 ), and when confusion may arise we will use the more precise notation ∇g(θ)| θ=θ0 . The convention is to think of vectors as column vectors and to use x ′ to denote the transpose of x.
Before giving the main set of assumptions for the Markov chain Φ we include for completeness some basic norm definitions.
. Definition 2.3. For a positive function V : X → [1, ∞) we define the V -operator norm distance between two Markov transition kernels P 1 and P 2 as Note: It can be shown that the h-operator norm distance can be written in terms of the h-total variation norm as We can now state a set of sufficient conditions that will guarantee that ∇α(θ 0 ) exists and that (1.6) holds. i) Suppose that Φ is ψ-irreducible for all θ ∈ B ǫ (θ 0 ). ii) Suppose that for all θ ∈ B ǫ (θ 0 ) there exist densities p(θ, x, y), differentiable at θ 0 and such that P (θ, x, dy) = p(θ, x, y)P (θ 0 , x, dy).
iii) Suppose there exists a set K ⊆ B(X), δ > 0, m ∈ N and a probability measure ν such that for all θ ∈ B ǫ (θ 0 ). iv) For the set K above suppose there exists a function V : X → [1, ∞) and constants 0 < λ < 1, b < ∞, such that θ=θ0 P (θ 0 , x, dy) and e i be the vector that has a 1 in the ith component and zeros elsewhere. Assume that for each for all x ∈ X and all k ∈ N. ii) A set K satisfying Assumption 2.4(iii) is said to be a small set.
The conditions in Assumption 2.4, which essentially impose geometric ergodicity (see e.g., [12]) of the chain Φ, are not necessary for the main convergence result of this section (Theorem 2.10), but have the advantage of allowing us to keep the arguments concise and focus on the estimators in the following sections. A similar set of conditions has been used in [8] (see Section 4.1). More general conditions ensuring the existence of the gradient outside of the geometric ergodicity setting can be found in [7], and more recently, in [13].
We will now proceed to give some properties of Φ, for which we will need the following definition. Proofs not included immediately after the corresponding statement can be found in Section 6. Definition 2.6. We say that the Markov chain Φ is h-ergodic if h : X → [1, ∞) and i) {Φ k : k ≥ 0} is positive Harris recurrent with invariant probability π. ii) the expectation πh is finite iii) for every initial condition x ∈ X, lim k→∞ ||P k (x, ·) − π|| h = 0.
The main idea behind the analysis of the gradient of the likelihood ratio L n (θ) is that under appropriate conditions each of its components is a square integrable martingale with respect to the family of filtrations generated by Φ. The next lemma makes this statement precise; its proof can be found in Section 6.
The analysis of α n ∇L n (θ 0 ) and of its expectation is based on a second martingale, one constructed via a solutionf to Poisson's equation Note that if this solution exists then the centered estimator α n − α(θ 0 ) can be written as follows: can be shown to be martingale differences. It follows that provided is the expectation of a product of two martingales. The lemma below gives precise properties of this second martingale.

A first likelihood-ratio estimator
In view of Theorem 2.10, the remainder of the paper is devoted to the analysis of potential estimators for ]. An obvious first choice would be to consider itself. Unfortunately, as mentioned in the introduction, α n ∇L n (θ 0 ) does not converge to an a.s. finite random variable; in fact, under additional assumptions, n −1/2 α n ∇L n (θ 0 ) converges in distribution to a multivariate normal random vector, which implies that α n ∇L n (θ 0 ) fails to converge at all. This observation is a simple consequence of the following weak convergence result, which will also be helpful in the analysis of the estimators considered in Section 4.
Throughout the rest of the paper let D([0, 1], R d ) denote the space of right-continuous R d -valued functions on [0, 1] d with left limits equipped with the standard Skorohod topology; we use ⇒ to denote weak convergence. From now on, the Markov chain Φ is always assumed to evolve according to P (θ 0 ).
In view of this theorem we have the following result for n −1/2 α n ∇L n (θ 0 ).
where Z is a d−dimensional multivariate normal random vector having mean zero and covariance matrix Proof. By Lemma 2.7 Φ is V -ergodic, and since |f | √ V < ∞ we have π(θ 0 )|f | < ∞. Then, by Theorem 17.0.1 in [12], It follows by Slutsky's lemma that Since α n ∇L n (θ 0 ) does not converge as n → ∞, we can define a new estimator with smaller variance by using as a control variate ∇L n (θ 0 ), that is, we seek an estimator of the form Our goal is to minimize the so-called generalized variance of Y (C), defined as the determinant of Σ Y (C) . The optimal choice for C is given by (see [14]). In the notation of Proposition 3.2, It can be shown (following the same arguments used in the proof of Theorem 3.1) that as n → ∞. Therefore, C * n → α(θ 0 )ΣΣ −1 = α(θ 0 )I, where I is the identity matrix of R d×d . We then have that α(θ 0 )I is the asymptotically optimal choice for the control variate coefficient and our new suggested estimator is Y (C * n ) = (α n − α(θ 0 ))∇L n (θ 0 ). Using again Theorem 3.1 we obtain the following convergence result.
Proof. By Theorem 3.1 we have Then, by the continuous mapping principle, We conclude that Y (C * n ) has the desired convergence properties and is a suitable estimator for E θ 0 [α n ∇L n (θ 0 )]. In the next section we consider other alternatives.

An integral-type estimator
As mentioned in the introduction, our second proposed estimator is obtained by first deriving an alternative representation for E θ0 [α n ∇L n (θ 0 )] in terms of a discrete stochastic integral. More precisely, we exploit the martingale properties of ∇L n (θ 0 ) to obtain that: as an estimator for E θ0 [α n ∇L n (θ 0 )].
Unfortunately, just as the estimator α n ∇L n (θ 0 ), Y n as defined above fails to converge to an a.s. finite random vector. This is a consequence of Theorem 3.1 again.
where B is a d−dimensional mean zero Brownian motion with covariance matrix Σ = (σ ij ), with σ ij = π(θ 0 )g ij for 1 ≤ i, j ≤ d, and I is the identity matrix of R d×d .
As before, we can try to solve the problem of the lack of convergence of Y n by using a centered estimator of the form This modification turns out to be the right one, and we obtain the following convergence result for this new estimator.
Proposition 4.2. Suppose that Assumption 2.4 is satisfied and define for 0 ≤ i, j ≤ d the functions g ij according to Theorem 3.1. Then, ), and note that by Theorem 3.1 we have ) with the convention that W n (1) ≡ 0. It follows thatŴ n (t) = S n (1) − S n (t) and the continuous mapping theorem gives Next, define the processes X n (t) = n −1/2Ŵ The same steps used in the proof of Proposition 4.1 show that the conditions of Theorem 2.7 of [9] are satisfied, and we obtain that X n , Z n , It follows that Y * n is a suitable estimator for E θ 0 [α n ∇L n (θ 0 )]. It remains to compare Y * n to Y (C * n ) from Section 3.

Computation of the asymptotic variance
The two previous sections provide details on two potential estimators for E θ 0 [α n ∇L n (θ 0 )], namely, Both of these estimators have the property, under Assumption 2.4, that their expectation converges to ∇α(θ 0 ), i.e., → ∇α(θ 0 ), as n → ∞ (Theorem 2.10), and unlike the estimators given in (3.1) and (4.1), they converge to a proper limiting distribution (Propositions 3.3 and 4.2). For comparison purposes we compute in this section the variance of these limiting distributions.
First, by Proposition 3.3 we have Y (C * n ) ⇒ Z 0Ẑ , where Z = (Z 0 , Z 1 , . . . , Z d ) ′ is a (d + 1)-dimensional multivariate normal with covariance matrix Σ andẐ = (Z 1 , . . . , Z d ) ′ . Therefore, by Isserlis' theorem, the (i, j)th component, 1 ≤ i, j ≤ d, of the limiting distribution's covariance matrix is given by Since the calculation of the covariance of the limiting distribution in this case is somewhat lengthier, we state the result in the following lemma and postpone the proof to Section 6.
We now compare the generalized variances of the two estimators, that is, the determinants of their covariance matrices. Provided B is positive definite we obtain where I is the R d×d identity matrix. Since B is positive definite, so is B −1 , and therefore v ′ B −1 v ≥ 0. We conclude that det(A) ≥ 2 d det(B), which suggests that Y * n is a better estimator for ∇α(θ 0 ) than Y (C * n ).

Proofs
This last section of the paper contains all the proofs that were not given in the prior sections. The first one corresponds to the martingale properties of ∇L n (θ 0 ).
Proof of Lemma 2.8. We start by noting that for any θ ∈ Θ and 1 ≤ i ≤ d, Since L n (θ 0 ) ≡ 1, it follows that Next, note that for any fixed x ∈ X we have Therefore, ∂ ∂θ i X p(θ, x, y)P (θ 0 , x, dy) for all x ∈ X. It follows that = 0 (since the integral is equal to one for all θ), which establishes that M n ∇L n (θ 0 ) is a martingale. To see that it is square integrable let M n = (M 1 n , . . . , M d n ) ′ and note that which is finite since |g ii | V < ∞ by Assumption 2.4(vi) and The next proof corresponds to the martingale constructed using the solution to Poisson's equation.
It remains to show that Z n is a square-integrable martingale. Clearly, so Z n is a martingale. To see that it is square-integrable note that Since |f | √ V < ∞ and both E θ 0 [V (Φ k )] < ∞ and π(θ 0 )V < ∞, then the above expression is finite, which completes the proof.
Next, we give the proof of Theorem 2.10, which states that under Assumption 2.4 the expectation of α n ∇L n (θ 0 ) converges to ∇α(θ 0 ).
Proof of Theorem 2.10. Define M i n = n j=1 D i j , 1 ≤ i ≤ d as in Lemma 2.8, and as in Lemma 2.9. By those same lemmas we have that M i n and Z n are square-integrable martingales.
Next, note that To show that 1 n E θ 0 [f (Φ n )M i n ] → 0 as n → ∞, note that by the Cauchy-Schwarz inequality Also, by Lemma 2.9 we havef 2 ≤ c 1 V , and since by Lemma 2.
For the other term we have by Assumption 2.4(vi) that |g ii | V < ∞, and therefore We conclude that 1 n E θ 0 [f (X n )M i n ] → 0 as n → ∞.
To show that 1 Let h i (x) = Xf (y) ∂ ∂θi p(θ 0 , x, y)P (θ 0 , x, dy) and note that by the Cauchy-Schwarz inequality and therefore |h i | V < ∞. It follows from the same arguments used above that with the limit π(θ 0 )h i well-defined and finite. It only remains to show that π(θ 0 )h i = ∂ ∂θi α(θ 0 ). To do this first note that where in the fifth and seventh steps we used the identity π(θ)P (θ) = π(θ) for all θ ∈ B ǫ (θ 0 ). Next, note It remains to show that the last two limits are zero. To analyze (6.2) recall that |h i | V < ∞, from where it follows that (6.2) is bounded by And to show that (6.3) is zero as well note that which combined with π(θ 0 )V < ∞ gives that (6.3) is bounded by This completes the proof.
The following is the proof of the main weak convergence theorem that is used to describe the behavior of all four estimators considered in Sections 3 and 4. It is essentially an application of the Functional Central Limit Theorem for multivariate martingales found in [15] (see also Theorems 1.4 and 1.2 in Chapter 7 of [1]).
where 0 is the zero vector in R d . Note that Since Φ is V -ergodic and |f 2 | V < ∞, Theorem 17.3.3 in [12] gives It follows by Slutsky's lemma that it suffices to show that X n ⇒ B in D([0, 1], R d+1 ). We will do so by showing that X n satisfies condition (ii) of Theorem 2.1 in [15].
Let ξ n k = X n (k/n) − X n ((k − 1)/n) and consider the matrix A n = A n (t) ∈ R (d+1)×(d+1) whose (i, j)th component is given by Then X i n (t)X j n (t) − A ij n (t) is a martingale adapted to G n,t for each 0 ≤ i, j ≤ d, and therefore, the A ij n = X i n , X j n are the predictable quadratic-covariation processes of X n . Also, for 1 ≤ i, j ≤ d we have Similarly, Since π(θ 0 )|g ij | ≤ π(θ 0 )(g ii g jj ) 1/2 ≤ (π(θ 0 )g ii ) 1/2 (π(θ 0 )g jj ) 1/2 < ∞ for each 0 ≤ i, j ≤ d, we have by Theorem 17.0.1 in [12] that Also, for each 0 ≤ i, j ≤ d, To see that the last expression converges to zero let h m (x) = V (x)1(V (x) > m), and note that E θ 0 [h m (Φ k )] → π(θ 0 )h m < ∞ as k → ∞, and monotone convergence gives π(θ 0 )h m → 0 as m → ∞, therefore we can choose N ∈ N large enough so that π(θ 0 )h N < δ. It follows that and since δ > 0 was arbitrary, the limit is zero.

Since we have thatĥ
a.s. P (θ 0 ), as n → ∞. Moreover, (f (Φ j ) − α(θ 0 )) and n −1/2 S n ⇒ B 0 in D([0, 1], R) by Theorem 3.1, with B 0 a mean zero Brownian motion. It follows that W n (t) − W (t) ⇒ 0 in D([0, 1], R). Also, by Theorem 3.1 again we have that It follows that since W is a non-random element of D([0, 1], R), Next, define the processes X n (t) = W n (t)I, X(t) = W (t)I, Z n (t) = n −1/2 ∇L ⌊nt⌋ (θ 0 ), and Z(t) = B(t), where I is the identity matrix of R d×d . Define G n,t = F ⌊nt⌋ . Clearly, X n and Z n are {G n,t }-adapted and Z n is a {G n,t }-martingale. Also, for t i = i/n, Consider now the process For each α > 0 let τ α n = 2α and note that P θ0 (τ α n ≤ α) = 0 ≤ 1/α and Finally, by (6.4) we have (X n , Z n ) ⇒ (X, Z) in D([0, 1], R d×d ×R d ). Therefore, the conditions of Theorem 2.7 of [9] are satisfied and we have X n , Z n , 1− 1 n 0 X n dZ n ⇒ X, Z, The last proof in the paper corresponds to the calculation of the variance of 1 0 (B 0 (1) − B 0 (s))IdB(s).