Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance. The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data. The parameter updates occur in continuous time and satisfy a stochastic differential equation. This paper analyzes the asymptotic convergence rate of the SGDCT algorithm by proving a central limit theorem (CLT) for strongly convex objective functions and, under slightly stronger conditions, for non-convex objective functions as well. An L$^p$ convergence rate is also proven for the algorithm in the strongly convex case. The mathematical analysis lies at the intersection of stochastic analysis and statistical learning.


Introduction
"Stochastic gradient descent in continuous time" (SGDCT) is a statistical learning algorithm for continuoustime models, which are common in science, engineering, and finance.SGDCT is the continuous-time analog of the well-known stochastic gradient descent.Given a continuous stream of data, stochastic gradient descent in continuous time (SGDCT) can estimate unknown parameters or functions in stochastic differential equation (SDE) models.[3] analyzes the numerical performance of SGDCT for a number of applications in finance and engineering.
Batch optimization for the statistical estimation of continuous-time models may be impractical for large datasets where observations occur over a long period of time.Batch optimization takes a sequence of descent steps for the model error for the entire observed data path.Since each descent step is for the model error for the entire observed data path, batch optimization is slow (sometimes impractically slow) for long periods of time or models which are computationally costly to evaluate (e.g., partial differential equations).
SGDCT provides a computationally efficient method for statistical learning over long time periods and for complex models.SGDCT continuously follows a (noisy) descent direction along the path of the observation; this results in much more rapid convergence.Parameters are updated online in continuous time, with the parameter updates θ t satisfying a stochastic differential equation.
Consider a diffusion X t ∈ X = R m : The goal is to statistically estimate a model f (x, θ) for f * (x) where θ ∈ R k .The function f * (x) is unknown.W t ∈ R m is a standard Brownian motion.The diffusion term W t represents any random behavior of the system or environment.The functions f (x, θ) and f * (x) may be non-convex.The stochastic gradient descent update in continuous time follows the SDE: where ∇ θ f (X t ; θ t ) is matrix-valued and α t is the learning rate.For example, α t could equal Cα C0+t .The parameter update (1.2) can be used for both statistical estimation given previously observed data as well as online learning (i.e., statistical estimation in real-time as data becomes available).
We assume that X t is sufficiently ergodic (to be concretely specified later in the paper) and that it has some well-behaved π(dx) as its unique invariant measure.As a general notation, if h(x, θ) is a generic L 1 (π) function, then we define its average over π(dx) to be h(θ) = X h(x, θ)π(dx).

Let us set
The gradient ∇ θ g(X t , θ) cannot be evaluated since f * (x) is unknown.However, dX t = f * (X t )dt + σdW t is a noisy estimate of f * (x)dt, which leads to the algorithm (1.2).SGDCT follows a noisy descent direction along a continuous stream of data produced by X t .
ḡ is a natural objective function for the estimation of the continuous-time dynamics.Heuristically, it is expected that θ t will tend towards the minimum of the function ḡ(θ) = X g(x, θ)π(dx).The data X t will be correlated over time, which complicates the mathematical analysis.This differs from the standard discrete-time version of stochastic gradient descent where the the data is usually considered to be i.i.d. at every step.
→ 0. In this paper, we prove several new results regarding the convergence rate of the algorithm θ t .
The first main result of the paper is an L p convergence rate and a central limit theorem in the case where ḡ(θ) is strongly convex.Let θ * be the global minimum of ḡ(θ).We prove that E[ θ t − θ * p ] ≤ K (C0+t) p/2 for p ≥ 1 and √ t(θ t − θ * ) d → N (0, Σ) when ḡ(θ) is strongly convex and the learning rate α t is Cα C0+t .The L p convergence rate is proven in Theorem 2.7 and the central limit theorem is proven in Theorem 2.8.We prove these results for models f (x, θ) with up to quadratic growth in θ and polynomial growth in x.Theorems 2.7 and 2.8 do not make use of the results in [3].
The second main result is a central limit theorem when the objective function ḡ(θ) is non-convex.We prove that √ t(θ t − θ * ) d → N (0, Σ) when ḡ(θ) is non-convex with a single critical point; see Theorem 2.12.We prove this result for models f (x, θ) with up to linear growth in θ and polynomial growth in x.As a part of the proof, we also strengthen the convergence result of [3], which did not allow f (x, θ) to grow in θ; see Theorem 2.10.
Analogous results for Theorems 2.7, 2.8, and 2.12 hold of course for a general class of learning rates α t ; see Proposition 2.13.The precise statement of the mathematical results and the technical assumptions required are presented in Section 2.
These mathematical results are important for two reasons.First, they establish theoretical guarantees for the rate of convergence of the algorithm.Secondly, they can be used to analyze the effects of different features such as the learning rate α t , the level of noise σ, and the shape of the objective function ḡ(θ).We are able to precisely characterize the regime under which the optimal convergence rate exists as well as characterize the limiting covariance Σ.The regime depends entirely upon the choice of the learning rate.
The proofs in this paper require addressing several challenges.First, fluctuations of the form t 0 α s h(X s , θ s )− h(θ s ) ds must be analyzed.We evaluate, and control with rate α 2 t , these fluctuations using a Poisson partial differential equation.Second, the model f (x, θ) is allowed to grow with θ.This means that the fluctuations as well as other terms can grow with θ.Therefore, we must prove an a priori stability estimate for θ t .Proving a central limit theorem for the non-convex ḡ(θ) in Theorem 2.12 is challenging since the convergence speed of θ t can become arbitrarily slow in certain regions, and the gradient can even point away from the global minimum θ * .We prove the central limit theorem for the non-convex case by analyzing two regimes [0, τ δ ] and [τ δ , ∞), where τ δ is defined such that θ t − θ * < δ for all t ≥ τ δ .The proof also requires the analysis of stochastic integrals with anticipative integrands, which is challenging since standard approaches (such as the Itô Isometry) cannot be directly applied.

Literature Review
The vast majority of the statistical learning, machine learning, and stochastic gradient descent literature addresses discrete-time algorithms.In contrast, this paper analyzes a statistical learning algorithm in continuous time.Below we review the existing literature that is most relevant to our work.We also comment on the importance of developing and analyzing continuous-time algorithms for addressing continuous-time models.
Many discrete-time papers study algorithms for θ n without the X-dynamics (for example, stochastic gradient descent with i.i.d.noise at each step).The inclusion of the X-dynamics makes analysis significantly more challenging.An L 2 convergence rate and central limit theorem result for discrete-time stochastic gradient descent is presented in [8].Our setup and assumptions are different from [8].Our proof leverages the continuous-time nature of our setting, which is the formulation of interest in many engineering and financial problems (see [3]).
[4] studies continuous-time stochastic mirror descent in a setting different than ours.In the framework of [4], the objective function is known.In this paper, we consider the statistical estimation of the unknown dynamics of a random process (i.e. the X process satisfying (1.1)).
Statisticians and financial engineers have actively studied parameter estimation of SDEs, although typically not with statistical learning or machine learning approaches.The likelihood function will usually be calculated from the entire observed path of X (i.e., batch optimization) and then maximized to find the maximum likelihood estimator (MLE).Unlike in this paper, the actual optimization procedure to maximize the likelihood function is often not analyzed.Some relevant publications in the financial statistics literature include [1], [2], [7], and [10].[7] derives the likelihood function for continuously observed X.The MLE can be calculated via batch optimization.
[1] and [2] consider the case where X is discretely observed and calculate MLEs via a batch optimization approach.[10] estimates parameters by a Bayesian approach.Readers are referred to [9,15,11] for thorough reviews of classical statistical inference methods for stochastic differential equations.
Continuous-time models are common in engineering and finance.There are often coefficients or functions in these models which are uncertain or unknown; stochastic gradient descent can be used to learn these model parameters from data.
It is natural to ask why use SGDCT versus a straightforward approach which (1) discretizes the continuoustime dynamics and then (2) applies traditional stochastic gradient descent.We elaborated in detail in regards to this issue in [3], where specific examples are provided there to showcase the differences.For completeness, let us briefly discuss the issues that arise.
SGDCT allows for the application of numerical schemes of choice to the theoretically correct statistical learning equation for continuous-time models.This can lead to more accurate and more computationally efficient parameter updates.Numerical schemes are always applied to continuous-time dynamics and different numerical schemes may have different properties for different continuous-time models.A priori performing a discretization to the system dynamics and then applying a traditional discrete-time stochastic gradient descent scheme can result in a loss of accuracy, or may not even converge, see [3].For example, there is no guarantee that (1) using a higher-order accurate scheme to discretize the system dynamics and then (2) applying traditional stochastic gradient descent will produce a statistical learning scheme which is higher-order accurate in time.Hence, it makes sense to first develop the continuous-time statistical learning equation, and then apply the higher-order accurate numerical scheme.
In addition to model estimation, SGDCT can be used to solve continuous-time optimization problems, such as American options.In [3], SGDCT was combined with a deep neural network to solve American options in up to 100 dimensions.Alternatively, one could discretize the dynamics and then use the Qlearning algorithm (traditional stochastic gradient descent applied to an approximation of the discrete HJB equation).However, as we showed in [3], Q-learning is biased while SGDCT is unbiased.Furthermore, in SDE models with Brownian motions, the Q-learning algorithm can blow up as the time step size ∆ becomes small; see [3] for details.

Organization of Paper
In Section 2 we state our assumptions and the main results of this paper.The proof of the L p convergence rate for p ≥ 1 is in Section 3. The proof of the CLT in the strongly convex case is in Section 4. Section 5 proves the central limit theorem for a class of non-convex models.In Section 6, the convergence rate results are used to analyze the behavior and dynamics of the stochastic gradient descent in continuous-time algorithm.Some technical results required for the proofs are presented in Appendix A. Appendix B contains the proof of Theorem 2.10, which strengthens the convergence result of [3].In particular, Appendix B provides the necessary adjustments to the proofs of [3] in order to guarantee convergence in the case where the model f (x, θ) is allowed to grow with respect to θ.

Main Results
We prove three main results.Theorem 2.7 is an L p convergence for the strongly convex case.Theorem 2.8 is a central limit theorem for the strongly convex case.Theorem 2.12 is a central limit theorem for the non-convex case with a single critical point.
We say that "the function h(θ) is strongly convex with constant C" if there exists a C > 0 such that z ⊤ ∆ θ h(θ)z ≥ Cz ⊤ z for any non-zero z ∈ R k .Conditions 2.6 and 2.11 require that CC α > 1 where C α is the magnitude of the learning rate and C is the strong convexity constant for the objective function ḡ(θ * ) at point θ = θ * .This is an important conclusion of the convergence analysis in this paper: the learning rate needs to be sufficiently large in order to achieve the optimal rate of convergence.This dependence of the convergence rate on the learning rate is not specific to the algorithm SGDCT in this paper, but applies to other algorithms (even deterministic gradient descent algorithms).We discuss this in more detail in Section 6.
It should be emphasized that the assumptions for Theorem 2.7 and Theorem 2.8 (the strongly convex case) allow for the model f (x, θ) to grow up to quadratically in θ.On the other hand, Theorem 2.12 (the non-convex case) is proven under the assumption that f (x, θ) grows up to linearly in θ.The growth of the model f (x, θ) is allowed to be polynomial in x for Theorems 2.7, 2.8, and 2.12.
The proofs of Theorems 2.7, 2.8, and 2.12 are in Sections 3, 4, and 5, respectively.Some technical results required for these proofs are presented in Appendix A. Appendix B contains the changes necessary to generalize the proof of convergence of [3] from bounded ḡ(θ) to ḡ(θ) that can grow up to quadratically in θ; see Theorem 2.10 for the corresponding rigorous statement.
Let us now list our conditions.Condition 2.1 guarantees uniqueness and existence of an invariant measure for the X process.Condition 2.1 and the second part of Condition 2.2 guarantee that equation (1.1) is wellposed.
Condition 2.1.We assume that σσ ⊤ is non-degenerate bounded diffusion matrix and lim |x|→∞ f * (x)•x = −∞ In regards to regularity of the involved functions, we impose the following Condition 2.2.
Then, there exists a function λ(x), growing not faster than polynomially in x , such that for any where ρ(ξ) is an increasing function on [0, ∞) with ρ(0) = 0 and ξ>0 ρ −2 (ξ)dξ = ∞.Notice that Condition 2.4 simplifies considerably when θ is one-dimensional and that it is always true in the case where ∇ θ f (x, θ) is independent of θ (which happens for example when f (x, θ) is affine in θ).
The Conditions 2.3 and 2.4 are necessary to prove that sup t≥0 E[ θ t p ] < K for p ≥ 2. This uniform bound on moments, which we prove in Appendix A.1, is in turn required for the proofs of the L p convergence and central limit theorem in Theorems 2.7 and 2.8.The function g(x, θ) is allowed to grow, which means that some a priori bounds on the moments of θ t are necessary.
In regards to the learning rate we assume Condition 2.5.The learning rate is α t = Cα C0+t where C α > 0 is a constant.Condition 2.5 makes the presentation of the results easier.However, as we shall see in Corollary 2.13 the specific form α t = Cα C0+t is not necessary as long as α t satisfies certain conditions.We chose to present the results under this specific form both for presentation purposes and because this is the usual form that the learning rate takes in practice.However, we do present the result for a general learning rate in Corollary 2.13.Condition 2.6. 1.
As will be seen from the proof of the Theorems 2.7 and 2.
In order to state the central limit theorem results we need to introduce some notation.Let us denote by v(x, θ) the solution to the Poisson equation (A.5) with where Σ is defined to be The main result of [3] states that if ḡ(θ) and its derivatives are bounded with respect to θ, then under assumptions of ergodicity for the X process one has convergence of the algorithm, in the sense that lim t→∞ ∇ θ ḡ(θ) = 0.In this paper, we allow growth of ḡ(θ) with respect to θ .In particular, as we shall state in Theorem 2.10 and prove in Appendix B, the results of [3] hold true without considerable extra work if one allows up to linear growth for f (x, θ) with respect to θ, which translates into up to quadratic growth for ḡ(θ) and up to linear growth of ∇ḡ(θ) with respect to θ.Let us formalize the required assumptions in the form of Condition 2.9 below.Condition 2.9.

∇ḡ(θ) is globally Lipschitz.
Theorem 2.10.Assume that Conditions 2.1, 2.2, 2.3, 2.4, 2.5 and 2.9 hold.Then, Theorem 2.10 proves convergence for non-convex ḡ(θ) even when f (x, θ) grows at most linearly in θ.Theorem 2.10 is proven in Appendix B. The proof is based on the uniform bound for the moments of θ established in Appendix A.
Theorem 2.10 is required for proving a central limit theorem for non-convex ḡ(θ).The central limit theorem for non-convex ḡ(θ) is proven in Theorem 2.12.
where Σ is defined as in Theorem 2.8.
If f (x, θ) and ∇ θ f (x, θ) are uniformly bounded in θ and polynomially bounded in x, Theorem 2.12 is true without Conditions 2.3 and 2.4.
Proposition 2.13 shows that under certain conditions on the learning rate α t , the specific form of the learning rate assumed in Condition 2.5 is not necessary.In particular, the convergence rate and central limit theorem results can be proven for a general learning rate α t .The proof of Proposition 2.13 follows exactly the same steps as the proofs of Theorems 2.7, 2.8, and 2.12 albeit more tedious algebra and is omitted.
t ), and In particular, the statements of Theorems 2.7, 2.8, and 2.12 then take the form where now Σ = Σi,j k i,j=1 is as in (4.9) but with the bracket term It is easy to check that if we use α s = Cα C0+s as the learning rate, then the conditions that appear in Proposition 2.13 all hold if CC α > 1.
We conclude this section by mentioning that in the bounds that appear in the subsequent sections, 0 < K < ∞ will denote unimportant fixed constants (that do not depend on t or other important parameters).The constant K may change from line to line but it will always be denoted by the same symbol K. Without loss of generality and in order to simplify notation in the proofs, we will let C 0 = 0, consider t ≥ 1 (i.e., the initial time is set to t = 1), and let σ be the identity matrix.
3 Theorem 2.7 -L p convergence rate in strongly convex case The proofs in this paper will repeatedly make use of two important uniform moment bounds.First, as we prove in Appendix A.1, we have that Second, it is known from [12] that under the imposed conditions for the X process To begin the proof for the L p convergence rate, re-write the algorithm (1.2) for θ t in terms of g(x, θ) and ḡ(θ).
A Taylor expansion yields: where θ 1 t is an appropriately chosen point in the segment connecting θ t and θ * .Substituting this Taylor expansion into equation (3.1) produces the equation: By Itô formula, we then have for p ≥ 2 Using the strong convexity of ḡ we obtain the inequality Let's now define the process and notice that M t solves the SDE Next, if we set Γ t = Y t p − M t we obtain that Next, we define the function and the comparison principle gives The next step is to rewrite the second term of (3.3), i.e., Γ We construct the corresponding Poisson equation and use its solution to analyze Γ 2 t .Define G(x, θ) = θ − θ * , ∇ θ ḡ(θ) − ∇ θ g(x, θ) and let v(x, θ) be the solution to the PDE L x v(x, θ) = G(x, θ).Here L x is the infinitesimal generator of the X process.Due to Theorem A.1, the Poisson PDE solution has the following bounds: for appropriate, but unimportant for our purposes, constants m 1 , m 2 , m 3 , m 4 .By Itô's formula: Define v t ≡ v(X t , θ t ) and recognize that: Using this result, Γ 2 t can be rewritten as: Let's first rewrite the first term Γ 2,1 t .We apply Itô's formula to Then, we have the following representation for Γ 2,1 t : where Mt is a mean zero and square integrable (this follows from the uniform moment bounds on the X and θ processes) Brownian stochastic intergal.
Recall now that we want to evaluate E Y t p .Recalling the definition of Γ t and taking expectation in (3.8) we obtain Recalling that Ψ (p) t,1 = t −pCCα and that CC a > 1 we get that Ψ (p) t,1 ≤ t −p .Hence we have obtained that for any p ≥ 2 the following inequality holds The next step is to proceed by induction.Using the uniform moment bounds for X and θ together with the polynomial growth of v(x, θ) and ζ(x, θ), we get for p = 2 t,s ζ(X s , θ s )ds where the unimportant constant K may change from line to line.Hence the desired statement is true for p = 2. Next let us assume that it is true for exponent p − 1 and we want to prove that it is true for exponent p.Using Hölder inequality with exponents r 1 , r 2 > 1 such that 1/r 1 + 1/r 2 = 1 and choosing and likewise Putting the last two displays together, (3.9) gives which is the statement of Theorem 2.7 for integer p ≥ 2. The statement for any p ≥ 1 then follows from Hölder inequality.This concludes the proof of the theorem.

Proof of Central Limit Theorem in strongly convex case
To prove the central limit theorem, we use a second-order Taylor expansion: where θ 1 t is an appropriate point chosen in the segment that connects θ t to θ * .Then, the evolution of θ t follows: Let Φ * t,s ∈ R k×k be the fundamental solution satisfying where I is the identity matrix.As in Section 3 we set without loss of generality C 0 = 0 and assume that the initial time is at t = 1.Then, Y t can be written in terms of Φ * t,s : for τ ≥ s.Note that the columns of the matrix solution Φ * τ,s evolve independently.This makes the analysis much simpler.Define Φ * ,j τ,s as the j-th column of Φ * τ,s and ∆ḡ(θ * ) i as the i-th row of the matrix ∆ḡ(θ * ).Consider the one-dimensional differential equation: , where we have used the strong convexity assumption.This of course yields the convergence rate: Changing variables again yields the convergence rate in the original time coordinate t: We recall some important properties of Φ * t,s (for reference, see Proposition 2.14 in [14]).Φ * t,s is differentiable in t with the semi-group property where v(x, θ) satisfies bounds similar to the ones in (3.4).Following similar steps as in Section 3, t is now analyzed.This term will produce the limiting Gaussian random variable.Recalling the reduction to σ = I, we have , and ∂ 3 v ∂x∂θ 2 (x, θ) ≤ K(1 + x q + θ ).The latter bounds on v(x, θ) and its derivatives are from equation (3.4).These bounds imply that the function H(x, θ) and its derivatives have polynomial growth in x and θ .Based on Theorem A.1, the solution w(x, θ) to the PDE L x w(x, θ) = H(x, θ) and its derivatives will also have at most polynomial growth in x and θ .
We will prove that √ tΓ 3,2 t + √ tΓ 4 t d → N (0, Σ) for the appropriate limiting variance-covariance matrix Σ.The proof will rely upon the Poisson partial differential equation approach using the at most polynomial growth of its solution and its derivatives together with the uniform boundedness of the moments of X t and θ t processes to analyze the rate of convergence.
The quadratic covariation matrix of It is necessary to show that Σ t p → Σ as t → ∞.To begin, we show a simpler limit.Consider the process: It will be proven now that Σt converges to a limit Σ as t → ∞.Let us now establish the limit Σ as t → ∞.Recall that ∆ḡ(θ * ) is both strictly positive matrix and symmetric.Therefore, by the eigenvalue decomposition we can write that ∆ḡ(θ * ) = U ΛU ⊤ , Hence, the (i, j)-th element of the matrix Σt takes the form Then, we get that the (i, j)-th element of the limiting quadratic covariation matrix Σ takes the form Σi,j = lim t→∞ Σt,i,j = It remains to show that E Σ t − Σt 1 → 0 as t → ∞.If this is true, then the triangle inequality yields Then, E Σ t − Σ 1 → 0 would imply that Σ t p → Σ.
To prove that E Σ t − Σt 1 → 0, we begin by defining By the triangle inequality, Σ t − Σt 1 ≤ Σ t − Vt 1 + Vt − Σt 1 .We first address the second term: The (i, j)-th element of the matrix Vt − Σt is: where θ 1 s is an appropriately chosen point in the segment connecting θ s and θ * .Recall now that v and its derivatives can grow at most polynomially in x and θ .As a result of the specific growth rates we have In addition, E[ θ s − θ * 2 ] ≤ K s from the convergence rate in Section 3. Using these facts, the uniform in time moments bound on θ t , and the Cauchy Schwartz inequality, Therefore, Vt − Σt 1 → 0 as t → ∞.Now, let's address Σ t − Vt 1 using the Poisson equation method. ds.
The (i, j)-th element of the matrix Σ t − Vt is: Using Itô's formula, the bounds (A.6) on the Poisson equation L x w = H, the moment bounds on X t and θ t , and the Itô Isometry, it can be shown that This of course implies that E Vt − Σt 1 → 0. Combining results and using the triangle inequality, we have the desired result Σ t p → Σ as t → ∞.The convergence in probability of the quadratic variation Σ t for equation (4.8) implies that (4.8) converges in distribution to a mean zero normal random variable with covariance Σ (see Section 1.2.2 in [15]).Combining all of the results yields the central limit theorem: which is our desired result.
5 Proof of Central Limit Theorem in the non-convex case From Theorem 2.10, we know that ∇ḡ(θ t ) → 0 almost surely.Under the imposed conditions this implies that either θ t − θ * → 0 or θ t → ∞.We must therefore first show that θ t remains finite almost surely.
The parameter θ t evolves as: Recall from Condition 2.11 that then either θ t,i (ω) → +∞ or θ t,i (ω) → −∞ since θ t has continuous paths (i.e., a divergent sequence such as −2 n cannot occur).Next, note that the second and third integrals in equation (5.1) converge to finite random variables almost surely.Suppose that θ t,i (ω) → +∞.This implies that there exists a T (ω) such that θ t,i > θ * i for all t ≥ T (ω).However, t T (ω) −α s ∇ θi ḡ(θ s )ds < 0. This, combined with the fact that the second and third terms in (5.1) converge to finite random variables almost surely, proves that θ t,i < +∞ with probability one.A similar argument shows that θ t,i > −∞ with probability one.Therefore, |θ t,i − θ * i | → 0 almost surely.θ t a.s.
We will also later make use of the fact that Φ t,s satisfies the semi-group property Φ t,s = Φ t,τ Φ τ,s (for reference, see Proposition 2.14 in [14]).Letting Y t = θ t − θ * , we obtain The first term Γ 1 t is analyzed below.
Proof.For t ≥ τ δ we have where The solution v(t, x) and its relevant partial derivatives will be growing at most polynomially in x and linearly in θ due to the assumptions of Theorem 2.12.Itô's formula yields the representation In order now to analyze the terms involved in Γ 2 t and Γ 3 t we need some intermediate results that we state now below.For presentation purposes the proof of these Lemmas is deferred to the end of this section.Lemma 5.2.Let ζ(x, θ) be a (potentially matrix-valued) function that can grow at most polynomially in x and θ.Then, we have that converges in probability to zero, as t → ∞.To see this it is enough to notice that all of these terms take the form of the I 1 t and I 2 t quantities mentioned in Lemma 5.2.The term √ tΓ 2,1 t also converges almost surely to zero by first rewriting using Itô formula, as it was done for the corresponding term of Section 3 (refer to (3.7) with p = 2 and replace Ψ (p) t,s by Φ t,s ) and then using again Lemma 5.2.
The limiting Gaussian random variable will be produced by Γ 2,2 t and Γ 3 t .Therefore, it remains to analyze Using the results from [15], it is sufficient to prove the convergence in probability of to a deterministic quantity.We recall here that h(x, θ) As before, let Φ * t,s be the solution to Recall that Φ * t,s satisfies the bound Define Σt and Σ * t as: Note that we already proved in the previous section that Σ * t converges in probability to Σ as t → ∞.We would like to show that Σt − Σ * t → 0 as t → ∞, almost surely.
To prove that Σ t − Σt → 0, we begin by defining By the triangle inequality, Σ t − Σt ≤ Σ t − Vt + Vt − Σt .We then have the following lemmas.
By the triangle inequality, Combining Lemmas 5.5, 5.3, and 5.4, Σ t − Σ p → 0 as t → ∞.Therefore, using the results from [15], Combining all of the results yields the central limit theorem: which is our desired result.

Proof of Lemmas 5.2-5.5
In this subsection we give the proof of the lemmas that were used in the proof of the central limit theorem of the non-convex case.First we need an intermediate result to properly handle convergence to zero of multidimensional stochastic integrals.Such a result should be standard in the literature, but because we did not manage to locate an appropriate statement we present Lemma 5.6.
Lemma 5.6.Let Z t = t 1 b(t, s, X s , θ s )dW s .Let p ∈ N be a given integer, and consider a constant c > p − 1 2 and a matrix E where EE ⊤ is positive definite, such that Proof of Lemma 5.6.Let η > 0 be arbitrarily chosen and construct the random variable From Section 1.2.2 in [15], For each fixed η > 0, the RHS of the inequality (5.3) converges to a finite quantity as t → ∞ due to the continuous mapping theorem and the convergence in distribution of Z t,i + Zt,i and Zt .Furthermore, the limit of the RHS can be made arbitrarily small by choosing a sufficiently small η.Therefore, for any δ > 0, there exists a η > 0 such that: Proof of Lemma 5.2.Let us first prove the first statement of the lemma.Without loss of generality, let t ≥ τ δ .To begin, divide [0, t] into two regimes [1, τ δ ∧ t] and [τ δ ∧ t, τ δ ∨ t]: Let us first study the second term.
Let us now define the quantity and notice that with probability one we have lim sup For ε > 0, consider now the event A t,ε = L t ≥ t ε−1/2 .Using the uniform in time bounds for the moments of X s and θ s we obtain that Then, Markov's inequality and the fact that CC α > 1 give The latter then implies that which then due to the Borel-Cantelli lemma, guarantees the existence of a finite positive random variable d(ω) and of some n 0 < ∞ such that for every n ≥ n 0 , For any t ∈ [2 n , 2 n+1 ] and n ≥ n 0 , Therefore, L t a.s.
→ 0. Next, we analyze the first term on the RHS of equation (5.4).If t > τ δ , the semi-group property yields: The constant C(τ δ ) is almost surely finite since P[τ δ < ∞] = 1 and because X s and θ s are almost surely finite for s ≤ τ δ .Therefore, using the constraint CC α > 1, we have obtained Combining results, the integral (5.4) converges to 0 almost surely as t → ∞.
Next, we prove the second statement of the lemma, where we shall use Lemma 5.6.In the notation of Lemma 5.
Proceeding as in the first part of the lemma shows that for each index i, j, the corresponding element of the matrix [I 2 t ] goes to zero, i.e., To complete the proof, we then need to use Lemma 5.6.Let us define By Lemma 5.6, if we show that for the appropriate choices of p, c and E, D t goes to zero in probability as t → ∞, then we would have shown that the second statement of the lemma holds, i.e. that I 2 t → 0 in probability as t → ∞.
We pick p = 2, c = 2CC α and E = I to be the identity matrix.Let us first start with D 2 t .Notice that for the (i, j) element of the matrix we have It is clear that from this point the analysis of D 2 t is identical to the analysis of I 1,2 t .In particular, define the quantity and, for ε > 0, consider the event Ât,ε = Lt ≥ t ε−1 .Using Markov's inequality and the uniform in time bounds for the moments of X s and θ s we obtain that From here on, the rest of the argument follows the Borel-Cantelli argument that was used for the proof of the first part of the lemma.This yields that D 2 t a.s.
→ 0 as t → ∞.Next, using the semi-group property and for τ δ < t, D 1 t can be re-written as: Therefore, by similar logic as for the term I 1,1 t , we obtain where C(τ δ ) is almost surely finite.Therefore, since CC α > 1, D 1 t a.s.
→ 0 as t → ∞.Consequently, as t → ∞, we have indeed obtained → 0, which by Lemma 5.6 implies that the second statement of the lemma is true.This concludes the proof of the lemma.
Proof of Lemma 5.3.Φ t,s can be expressed in terms of Φ * t,s .To see this, first perform a Taylor expansion: C t are the third-order partial derivatives of ḡ(θ) and, based on the imposed assumptions, they are uniformly bounded.Therefore, Recall that there exists an almost surely finite random time τ δ such that Y t < δ for all t ≥ τ δ and any δ small enough.From the bounds on Φ t,s and Φ * t,s , we have that for t > s > τ δ where δ small enough, We have used the Cauchy-Schwartz inequality in the first inequality above.Now, consider the case where t > τ δ but s < τ δ .
C(τ δ ) is almost surely finite since τ δ is almost surely finite.The third term in (5.5) is similar.Using (5.5) and the bound (5.6), we have that with probability one: This implies that lim t→∞ Σt − Σ * t = 0 since δ can be as small as we want.This concludes the proof of the lemma.
Proof of Lemma 5.4.
Using the semi-group property, The (i, j)-th element of the matrix Vt − Σt is: By the Cauchy-Schwartz inequality and the bound ∇ θ h(θ) ≤ K(1 + θ ) we have (5.7) We also have that: Since τ δ < ∞ with probability 1, C(τ δ ) is also finite with probability 1. Combining the results from equations (5.7) and (5.8), Therefore, we have that lim sup t→∞ Vt − Σt ≤ Kδ 2 .Since δ is arbitrarily small, we have that Vt − Σt i,j a.s.→ 0, concluding the proof of the lemma.
Proof of Lemma 5.5.Notice that we have (5.9) and consequently the (i, j)-th element of the matrix Σ t − Vt is: The solution w n,k ′ (t, θ) and its relevant partial derivatives will grow at most polynomially with respect to θ and x due to the assumptions of Theorem 2.12.The next step is to rewrite the difference Vt − Σt i,j using that Poisson equation and Itô's formula and then to show that each term on the right hand side of the resulting equation goes to zero.
For example we shall have that where ( * ) is a collection of Riemann integrals resulting from the application of Itô's formula.Now each of these terms can be shown to go to zero with an argument exactly parallel to Lemma 5.2.Due to the similarity of the argument, the details are omitted.

Convergence Analysis
The central limit theorem provides an important theoretical guarantee for the performance of the SGDCT algorithm developed in [3].Theorem 2.12 is particularly significant since it shows that the asymptotic convergence rate of t − 1 2 even holds for a certain class of non-convex models.This is important since many models are non-convex.
In addition, the analysis yields insight into the behavior of the algorithm and provides guidance on selecting the optimal learning rate for numerical performance.The regime where the central limit theorem holds with the optimal rate √ t is C α C > 1. C α is the magnitude of the learning rate.For example, take α t = Cα C0+t .Therefore, the learning rate magnitude must be chosen sufficiently large in order to achieve the optimal rate of convergence.The larger the constant C is, the steeper the function ḡ(θ) is around the global minimum θ * .The smaller the constant C is, the smaller the function ḡ(θ) is around the global minimum θ * .The "flatter" the region around the global minimum point, the larger the learning rate magnitude must be.If the region around the global minimum point is steep, the learning rate magnitude can be smaller.
The condition of C α C > 1 to ensure the convergence rate of t − 1 2 is not specific to the SGDCT algorithm, but is in general a characteristic of continuous-time statistical learning algorithms.The convergence rate of any continuous-time gradient descent algorithm with a decaying learning rate will depend upon the learning rate magnitude C α .Consider the deterministic gradient descent algorithm Let α t = Cα C0+t and assume ḡ(θ) is strongly convex.Then, Note that the convergence rate depends entirely upon the choice of the learning rate magnitude C α .If C α is very small, the deterministic gradient descent algorithm will even converge at a rate much smaller than t − 1 2 .t − 1 2 is the fastest possible convergence rate given that the noise in the system (1.1) is a Brownian motion.This is due to the quadratic variation of a Brownian motion growing linearly in time.However, other types of noise with variances which grow sub-linearly in time could allow for a faster rate of convergence than t − 1 2 .An example of a stochastic process whose variance grows sub-linearly in time is a fractional Brownian motion with appropriately chosen Hurst parameter.Analyzing the convergence rate under more general types of noise would be a very interesting topic for future research.
In the central limit theorem result, we are also able to precisely characterize the asymptotic covariance Σ = Σi,j The covariance depends upon the eigenvalues and eigenvectors of the matrix ∆ḡ(θ * ), which is the Hessian matrix ∆ḡ(θ) at the global minimum θ * .The larger the eigenvalues, the smaller the variance.This means that the steeper the function ḡ(θ) is near the global minimum θ * , the smaller the asymptotic variance.The flatter the function ḡ(θ) is near the global minimum, the larger the asymptotic variance.If the function is very flat, θ t 's drift towards θ * is dominated by the fluctuations from the noise W t .The covariance also depends upon the learning rate magnitude C α .The larger the learning rate magnitude, the larger the asymptotic variance Σ.Although a sufficiently large learning rate is required to achieve the optimal rate of convergence t − 1 2 , too large of a learning rate will cause high variance.

Conclusion
Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance.The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data.The algorithm updates satisfy a stochastic differential equation.This paper analyzes the asymptotic convergence rate of the SGDCT algorithm by proving a central limit theorem.An L p convergence rate is also proven for the algorithm.
In addition to a theoretical guarantee, the convergence rate analysis provides important insights into the behavior and dynamics of the algorithm.The asymptotic covariance is precisely characterized and shows the effects of different features such as the learning rate, the level of noise, and the shape of the objective function.
The proofs in this paper require addressing several challenges.First, fluctuations of the form t 0 α s h(X s , θ s )− h(θ s ) ds must be analyzed.We evaluate, and control with rate α 2 t , these fluctuations using a Poisson partial differential equation.Secondly, the model f (x, θ) is allowed to grow with θ.This means that the fluctuations as well as other terms can grow with θ.Therefore, we must prove an a priori stability estimate for θ t .Proving a central limit theorem for the non-convex ḡ(θ) in Theorem 2.12 is challenging since the convergence speed of θ t can become arbitrarily slow in certain regions, and the gradient can even point away from the global minimum θ * .We prove the central limit theorem for the non-convex case by analyzing two regimes [0, τ δ ] and [τ δ , ∞), where τ δ is defined such that θ t − θ * < δ for all t ≥ τ δ .The proof also requires the analysis of stochastic integrals with anticipative integrands, which is challenging since standard approaches (such as the Itô Isometry) cannot be directly applied.

A Preliminary Estimates
This section presents two key bounds that are used throughout the paper.Section A.1 proves uniform in time moment bounds for θ t .That is, we prove that E[ θ t p ] is bounded uniformly in time.Section A.2 presents a bound on the solutions for a class of Poisson partial differential equations.In the paper, we relate certain equations to the solution of a Poisson partial differential equation and then apply this bound.
where κ(x) is from Condition 2.3.Then, we get similarly that for θ > 0 and θt = Next, we address the term √ tΓ 3 t .We construct the corresponding Poisson equation and use its solution to analyze Γ 3 t .Define G(x, θ) = ∇ θ ḡ(θ) − ∇ θ g(x, θ) and let v(x, θ) be the solution to the PDE L x v(x, θ) = G(x, θ).Proceeding in a fashion similar to equation (3.5) (but with different function v now) we write
8, one needs to control terms of the form t 0 α t (∇ḡ(θ s ) − g(X s , θ s ))ds.Due to ergodicity of the X process one expects that such terms are small in magnitude and go to zero as t → ∞.However, the speed at which they go to zero is what matters here.We treat such terms by rewriting them equivalently using appropriate Poisson type partial differential equations (PDE).Conditions 2.1, 2.2 and 2.6 guarantee that these Poisson equations have unique solutions that do not grow faster than polynomially in the x and θ variables (see Theorem A.1 in Appendix A).
.7) Now, we are ready to put things together.Equation (3.2) with p − 2 in place of p is then used to evaluate the second to last term of (3.7) and similarly for the quadratic covariation term d Y s p−2 , v s of the last term of (3.7).Plugging (3.7) in (3.6) and that in (3.3), we get that there is an unimportant constant K < ∞ large enough and a matrix-valued function ζ(x, θ) that grows at most polynomially in x and θ such that we notice that a time transformation as in (4.3) allows us to write that for τ (t) = s + * )(τ −s) .
[5] f ∇ T θ f (X t , θt ) ∇ θf (X t , θt )dW t + α t τ (X t , θt )d WtDue to Conditions 2.3 and 2.4 and continuity of the involved drift and diffusion coefficients for θ , θ > R > 0, we may use the comparison theorem (see for example[5]) to obtain thatP θ t ≤ θt , t ≥ 0 = 1.(A.1)It is easy to see that the proof of the comparison theorem 1.1 of[5]goes through almost verbatim, despite the presence of the term λ(x) in Condition 2.4.The reason is that |λ(x)| is assumed to have at most polynomial growth in x and all moments of X t are bounded uniformly in t.Now, notice that θt can be written as the solution to the integral equation θt = θ0 e − t 0 αsκ(Xs)ds + where the unimportant finite constant K < ∞ changes from line to line.Hence, Gronwall lemma then immediately gives that for any p ≥ 1 there exists a finite constant K < ∞ such that sup θi t θj t t 0 α s e − t s αrκ(Xr )dr ∇ θ f (X s , θs )dW s .From the latter representation we obtain, recall that κ(x) is almost surely positive, that for any p ≥ 1 E θt 2p ≤ E θ0 2p + E t 0 α s e − t s αr κ(Xr )dr ∇ θ f (X s , θs )dW s 2p ≤ E θ0 2p + E t 0 α 2 s e −2 t s αr κ(Xr )dr ∇ θ f (X s , θs ) s αrκ(Xr )dr ∇ θ f (X s , θs ) 2p ds ≤ E θ0 2p + E s ∇ θ f (X s , θs ) t>0 E θt 2p ≤ K. (A.2)