Moderate deviations for recursive stochastic algorithms

We prove a moderate deviation principle for the continuous time interpolation of discrete time recursive stochastic processes. The methods of proof are somewhat different from the corresponding large deviation result, and in particular the proof of the upper bound is more complicated. The results can be applied to the design of accelerated Monte Carlo algorithms for certain problems, where schemes based on moderate deviations are easier to construct and in certain situations provide performance comparable to those based on large deviations.


Introduction
In this paper we consider R d -valued discrete time processes of the form where {υ i (·)} i∈N 0 are zero mean random independent and identically distributed (iid) vector fields, and focus on their continuous time piecewise linear interpolations {X n (t)} 0≤t≤T with X n (i/n) = X n i (see (2.5) for the precise definition). Under certain conditions there is a law of large number limit X 0 ∈ C([0, T ] : R d ), and the large deviations of X n from this limit have been studied extensively (see, e.g., [1,8,10,13,15]). Here we introduce a scaling a(n) satisfying a(n) → 0 and a(n) √ n → ∞, and study the amplified difference between X n and its noiseless version X n,0 (see Section 2 for the definition of X n,0 ): Y n = a(n) √ n(X n − X n,0 ).
Under Condition 2.1 introduced below sup t∈[0,T ] X 0 (t) − X n,0 (t) ∼ O(1/n), and hence this will behave the same asymptotically as a(n) √ n(X n − X 0 ) We demonstrate, under weaker conditions on the noise υ i (·) than are necessary when considering X n , that Y n satisfies the large deviation principle on C([0, T ] : R d ) with a "Gaussian" type rate function. As is customary for this type of scaling, we refer to this as moderate deviations.
To demonstrate this result we prove the equivalent Laplace principle, which involves evaluating limits of quantities of the form when F is bounded and continuous. This is done by representing each of these quantities in terms of a stochastic control problem, and then using weak convergence methods as in [10]. Key results needed in this approach are establishing tightness of controls and controlled processes, and identifying their limits.
While one might expect the proof of this moderate deviations result to be similar to the corresponding large deviations result, there are important differences. For example, the tightness proof is significantly more complicated in the case of moderate deviations than it is in the case of large deviations. For large deviations one is able to establish an a priori bound on certain relative entropy costs associated with any sequence of nearly minimizing controls, and under this boundedness of the relative entropy costs, the empirical measures of the controlled driving noises as well as the controlled processes are tight. However, owing to the scaling in moderate deviations, even with the information that the analogous relative entropy costs decay like O(1/a(n) 2 n), tightness of the empirical measures of the noises does not hold. Instead, one must consider empirical measures of the conditional means of the noises, and additional effort is required for the law of large numbers type result that shows that the conditional means are adequate to determine the limit. This extra difficulty arises for moderate deviations (even with the vanishing relative entropy costs), because the noise itself is being amplified by a(n) √ n.
A second way in which the proofs for large and moderate deviations differ is in their treatment of degenerate noise, i.e., problems where the support of υ i (·) is not all of R d . This leads to significant difficulties in the proof of the large deviation lower bound, and requires a delicate and involved mollification argument. In contrast, the proof in the setting of moderate deviations, though more involved than the nondegenerate case, is much more straightforward.
As a potential application of these results we mention their usefulness in the design and analysis of Monte Carlo schemes for events whose probability is small but not very small. For such problems the performance of standard Monte Carlo may not be adequate, especially if the quantity must be computed for many different parameter settings, as in say an optimization problem. Then accelerated Monte Carlo may be of interest, and as is well known such schemes (e.g., importance sampling and splitting) benefit by using information contained in the large deviation rate function as part of the algorithm design (e.g., [3,6,11,12]). In a situation where one considers events of small but not too small probability one may find the moderate deviation approximation both adequate and relatively easy to apply, since moderate deviations lead to situations where the objects needed to design an efficient scheme can be explicitly constructed in terms of solutions to the linear-quadratic regulator. These issues will be explored elsewhere.
The existing literature on moderate deviations considers various settings. Baldi [2] considers the same scaling used here but with no state dependence. For the empirical measure of a Markov chain, de Acosta [5] and de Acosta and Chen [4] prove lower and upper bounds, respectively. Guillin [16] considers inhomogeneous functionals of a "fast" continuous time ergodic Markov chain, and in [17] this is extended to a small noise diffusion whose coefficients depend on the "fast" Markov chain. There are also results for martingale differences such as Dembo [7], Gao [14], and Djellout [9]. For various reasons, the issues previously mentioned regarding the difficulties in the proof of the upper bound and the simplification in the lower bound for degenerate noise do not play a role in these papers.
The paper is organized as follows. Section 2 gives the statement of the problem and notation. Section 3 contains the proof of tightness and the characterization of limits, which account for most of the mathematical difficulties, and are also the main results needed to prove the Laplace principle. Sections 4 and 5 give the proofs of the upper and lower Laplace bounds. Although all proofs are given for the time interval [0, 1], they extend with only notational differences to [0, T ] for any T ∈ (0, ∞).

Background and Notation
where the {υ i (·)} i∈N 0 are zero mean iid vector fields with distribution given by the stochastic kernel µ The subscript c reflects the fact that this log moment generating function uses the centered distribution µ x , rather than the usual H(x, α) = H c (x, α) + α, b(x) . We will use the following.
is continuous with respect to the topology of weak convergence.
• b(x) is continuously differentiable, and the norm of both b(x) and its derivative are uniformly bounded by some constant K b < ∞.
Throughout this paper we let α 2 A = α, Aα for any α ∈ R d and symmetric, nonnegative definite matrix A. Define and note that the weak continuity of µ x with respect to x and (2.1) ensure that A (x) is continuous in x and its norm is uniformly bounded by some constant K A . Note that for all i, j ∈ {1, . . . , d} and x ∈ R d , and that A(x) is nonnegative-definite and symmetric. For x ∈ R d we can therefore write where Q(x) is an orthogonal matrix whose columns are the eigenvectors of A(x) and Λ(x) is the diagonal matrix consisting of the eigenvalues of A(x) in descending order. In what follows we define Λ −1 (x) to be the diagonal matrix with diagonal entries equal to the inverse of the corresponding eigenvalue for the positive eigenvalues, and equal to ∞ for the zero eigenvalues. Then when we write we mean a value of ∞ for α ∈ R d not in the linear span of the eigenvectors corresponding to the positive eigenvalues, and the standard value for vectors α ∈ R d in that linear span. Assumption (2.1) implies there exists some K DA < ∞ and λ DA ∈ (0, λ] (independent of x) such that and consequently for all α ≤ λ DA and all x ∈ R d Define the continuous time linear interpolation of X n i by X n (i/n) = X n i for i = 0, ..., n and for t ∈ (i/n, i/n + 1/n). In addition, define and let X n,0 (t) be the analogously defined continuous time linear interpolation. Clearly X n, in probability. One can estimate probabilities for events involving paths outside the law of large numbers limit X 0 by proving a large deviation principle and finding the corresponding rate function. Under significantly stronger assumptions, including the assumption it has been shown that X n (t) satisfies the large deviation principle on C([0, 1] : R d ) with sequence r(n) = 1/n and rate function [10,19,20,21,22]. Assume a(n) satisfies a(n) → 0 and a(n) √ n → ∞.
As noted in the introduction, the result stated below also holds with the Db(X 0 (s))φ(s)ds I M is essentially the same as what one would obtain by using a linear approximation around the law of large numbers limit X 0 of the dynamics and a quadratic approximation of the costs in I L . To prove the LDP, it suffices to show the Laplace principle [10, Theorem 1.
, the relative entropy of η with respect to µ is defined by if η is absolutely continuous with respect to µ, and R( η µ) . = ∞ otherwise. For general properties of relative entropy we refer to [ and, similar to (2.5),X n (t) andȲ n (t) are the continuous time linear interpolations of {X n i } i=0,...,n and {Ȳ n i } i=0,...,n . Note that η i depends on past values of the noise, but we suppress this dependence in the notation. We will prove (2.7) by proving the lower bound and the upper bound We will use a tightness and weak convergence result in the proofs of both of these bounds, but first establish notation used in the rest of the paper.
for t ∈ [i/n, i/n + 1/n], i = 0, . . . n − 1 be their continuous time linear interpolations. Define the conditional means of the noises the amplified conditional meanŝ and random measures on We will refer to this construction when given η n to identify associated X n ,Ȳ n ,ŵ n andη n . Given ν ∈ P(E 1 × E 2 ), with each E i , i = 1, 2 a Polish space, let ν 2 denote the second marginal of ν, and let ν 1|2 denote the conditional distribution on E 1 given a point in E 2 .
Theorem 2.5 Let {η n } be a sequence of measures, each η n ∈ P((R d ) n ), and define the corresponding random variables as in Construction 2.4. Assume that for some (2.14) Consider a subsequence (keeping the index n for convenience) such that {(η n ,Ȳ n )} converges weakly to (η,Ŷ ). Then with probability 1η 2 (dt) is Lebesgue measure andŶ In addition, (2.16) 3 Proof of Theorem 2.5 Assume that the bound (2.14) holds. We will show tightness of the {η n } measures using the following lemma.
be the Legendre transform of H c (x, ·). Then for any x ∈ R d and η ∈ P(R d ) Proof. While the result is likely known we could not locate a proof (see and so for completeness provide the details.
and consequently for any Define the bounded, continuous function and note that (3.3) and dominated convergence give In addition, dominated convergence gives and monotone convergence gives By the Donsker-Varadhan variational formula [10, Lemma 1.4.3(a)] which completes the proof of the lemma.
The lemma implies the following theorem, which in turn will give tightness of {η n }. LetK Then with e i denoting the standard unit vectors where the first inequality follows from making a specific choice of α and the second uses (2.4). Therefore Using the bound on L c from Lemma 3.1 together with (2.14), For the uniform integrability, let C ∈ (1, ∞) be arbitrary and consider n large enough that Since λ DA ≥ 1/a(n) √ n (which corresponds to using the estimate above with Therefore which combined with the estimate (3.4) with G replaced by C and (3.5) gives We conclude that which is the claimed uniform integrability.
We continue with the proof of Theorem 2.5. Note that g(y, t) = y is a tightness function on R d ⊗ [0, 1], so by [10, Theorem A.3.17] is a tightness function on P(R d ⊗ [0, 1]) and is a tightness function on P(P (R d ⊗ [0, 1])). Since {η n } is tight and consequently there is a subsequence of {η n } which converges weakly. To simplify notation we retain n as the index of this convergent subsequence, and denote the weak limit of {η n } byη. Note that for all n the second marginal ofη n (dy ⊗ dt), which we denote byη n 2 (dt), is Lebesgue measure, and thereforeη 2 (dt) is Lebesgue measure with probability 1.
Our aim is to show thatȲ n (t) →Ŷ (t) weakly in C([0, 1] : R d ), wherê Y (t) is given by (2.15) in terms of the weak limitη. To achieve this we introduce the following processes which serve as intermediate steps.
together with its continuous time linear interpolation defined for t ∈ [i/n, i/n+ Note thatȲ n differs fromY n becauseȲ n is driven by the actual noises andY n is driven by their conditional means. While the driving terms ofŶ n andY n are the same [recall that a(n) √ nw n (t) =ŵ n (t)], they differ in thatY n is still a linear interpolation of a discrete time process whereasŶ n satisfies an ODE. The goal is to show that along the subsequence whereη n →η weaklȳ With this lemma and the uniform integrability of {η n } given in Theorem 3.2, tightness follows. Proof. It suffices to show that for any ε > 0 there is δ > 0 such that By Theorem 3.2 T (C) → 0 as C → ∞. Define also K η = sup n∈N E 1 0 ŵ n (t) dt, which is finite by Theorem 3.2. Let ε > 0 be arbitrary. Then for any s < t satisfying t − s ≤ δ the previous lemma implies

Hence by Markov's inequality
Choose C < ∞ such that T (C) < ε 2 /2 and then choose δ > 0 so that the δ(C + K b e K b K η ) < ε 2 /2. This shows the tightness of {Ŷ n }. The tightness of { · 0ŵ n ds} is simpler, and follows from the bound We still need to show thatŶ n converges toŶ . This also relies on the uniform integrability given by Theorem 3.2.
It remains to showȲ n −Y n → 0 andY n −Ŷ n → 0. We begin with Y n −Y n → 0. Recall that the difference betweenȲ n andY n is that the first is driven by the actual noises and the second is driven by their conditional means. The following theorem is a law of large numbers type result for the difference between the noises and their conditional means, and is the most complicated part of the analysis. Proof. According to (2.14) Because of this the (random) Radon-Nikodym derivatives are well defined and can be selected in a measurable way. We will control the magnitude of the noise when the Radon-Nikodym derivative is large by bounding 1 n for large r.
From the bound on the moment generating function (2.1), Let σ = min{λ/2 d+1 , 1} and recall the definition ℓ(b) In addition Markov's inequality gives for r ≥ e −1 Since by Jensen's inequality , we obtain the overall bound Using this result we can complete the proof. Define For any for any δ > 0 P max k=0,...,n−1 The second term is a submartingale so by Doob's submartingale inequality P max k=0,...,n−1 and the finiteness is due to (2.1). We can use Jensen's inequality with the third term and get the same bound that was shown for the first. We have P max k=0,...,n−1 Combining the bounds for these three terms with (3.7) gives P max k=0,...,n−1 Choosing r = 1/a(n) and using a(n) → 0, a(n) √ n → ∞ gives P max k=0,...,n−1 as n → ∞, which completes the proof.
This theorem, combined with the following discrete version of Gronwall's inequality, will allow us to proveȲ n −Y n → 0. Proof. Recall that so with W n k defined as in Theorem 3.6 Using Lemma 3.7 gives Since max i∈{1,...,n} { W n i } → 0 in probability max i∈{1,...,n} To complete the proof of the convergence we need to showY n −Ŷ n → 0. Recall that these two processes have the same driving terms but different drifts, in thatŶ n satisfies the ODÊ Y n (t) = t 0 Db(X 0 (s))Ŷ n (s)ds + t 0ŵ n (s)ds whileY n is the linear interpolation of the discrete time process defined by However, essentially the same arguments as those used in Lemma 3.4 to show tightness of {Ŷ n } can be used to prove tightness of {Y n }, and then it easily follows as in Lemma 3.5 that any limit will satisfy the same ODE (2.15) as the limit of {Ŷ n }, and thereforeY n −Ŷ n → 0 follows. CombiningȲ n −Y n → 0,Y n −Ŷ n → 0, andŶ n →Ŷ demonstrates that along the subsequence whereη n →η weaklyȲ n →Ŷ in distribution, which implies that along this subsequence (η n ,Ȳ n ) → (η,Ŷ ) weakly. We have already shown that with probability 1η 2 (dt) is Lebesgue measure and so the proof of convergence (i.e., the first part of Theorem 2.5) is complete.
Note that the tightness of {γ n } follows easily from (3.8) and from the tightness of {η n }. Thus given any subsequence we can choose a further subsequence (again we will retain n as the index for simplicity) along which {γ n } converges weakly to some limit γ on where γ 2,3 is the second and third marginal of γ. If we establish (2.16) for this subsequence it follows for the original one using a standard argument by contradiction. For σ > 0 let x − X 0 (t) ≤ σ be closed sets centered around X 0 (t) in the x variable, and note that by (3.8) and weak convergence, for all σ > 0

January 24, 2014
Thus so with probability 1 γ puts all its mass on (x, y, t) : x = X 0 (t) . Therefore with probability 1, for a.e. (y, t) under γ 2,3 (dy ⊗ dt), Combined with the fact that the second marginal ofη (dy ⊗ dt) is Lebesgue measure, this gives Then uniformly in x and compact subsets of β and as K → ∞L for all (x, β) ∈ R 2d . Combining this with Lemma 3.1 and using Fatou's lemma for weak convergence, for all K. Then using the monotone convergence theorem, the decomposition (3.9), and Jensen's inequality in that order shows that which is (2.16).

Laplace Upper Bound
The goal of this section is to prove (2.12), which due to the minus sign corresponds to the Laplace upper bound. Suppose for each n that η n comes within ε of achieving the infimum in (2.9), so that Consequently we can choose a subsequence of {η n } (we retain n as the index for convenience) along which the conclusions of Theorem 2.5 hold. Combin-ing this with (4.1) gives with φ u defined as in (2.8). Since ε > 0 is arbitrary, we have the lower bound (2.12).

Laplace Lower Bound
The goal of this section is to prove (2.13). Note that for u, v ∈ L 2 ([0, 1] : R d ) Thus by Gronwall's inequality , the proof of the Laplace lower bound is reduced to showing that for an arbitrary u ∈ C([0, 1] : 2) The main difficulty is to deal with the possible degeneracy of the noise. Recall the orthogonal decomposition of A −1 (x) (2.2). Define for u(s) > K .
, and note that φ u,K solves Db(X 0 (s))φ u,K (s)ds To simplify notation we define s n i .
= i/n and s n (t) = ⌊nt⌋ /n, where ⌊a⌋ is the integer part of a. Note that s n (t) − t → 0 uniformly for t ∈ [0, 1] as n → ∞. For n sufficiently large max 0≤i≤n−1 1 a(n) √ n A −1/2 K X 0 (s n i ) u K (s n i ) ≤ 1 a(n) √ n K 2 ≤ λ DA and we can define the sequence {(X n,u,K ,Ȳ n,u,K , η n,u,K ,η n,u,K )} as in Construction 2.4 with η n,u,K i (dy) = exp y, 1 a(n) √ n A −1/2 K X 0 (s n i ) u K (s n i ) −H c X n,u,K i , 1 a(n) √ n A −1/2 K X 0 (s n i ) u K (s n i ) µXn,u,K i (dy). The next result identifies the limit in probability of the controlled processes and an asymptotic bound for the relative entropies.
Using (2.9) and the fact that any given control is suboptimal, − a(n) 2 log E e ds + F (φ u,K ).
Sending K → ∞ and using Theorem 5.2 gives (5.2), and hence completes the proof of the lower bound (2.13).