A Concentration Bound for Stochastic Approximation via Alekseev's Formula

Given an ODE and its perturbation, the Alekseev formula expresses the solutions of the latter in terms related to the former. By exploiting this formula and a new concentration inequality for martingale-differences, we develop a novel approach for analyzing nonlinear Stochastic Approximation (SA). This approach is useful for studying a SA's behaviour close to a Locally Asymptotically Stable Equilibrium (LASE) of its limiting ODE; this LASE need not be the limiting ODE's only attractor. As an application, we obtain a new concentration bound for nonlinear SA. That is, given $\epsilon>0$ and that the current iterate is in a neighbourhood of a LASE, we provide an estimate for i.) the time required to hit the $\epsilon-$ball of this LASE, and ii.) the probability that after this time the iterates are indeed within this $\epsilon-$ball and stay there thereafter. The latter estimate can also be viewed as the `lock-in' probability. Compared to related results, our concentration bound is tighter and holds under significantly weaker assumptions. In particular, our bound applies even when the stepsizes are not square-summable. Despite the weaker hypothesis, we show that the celebrated Kushner-Clark lemma continues to hold. %

1. Introduction. Stochastic Approximation (SA), first introduced in [32], refers to recursive methods that can be used to find optimal points or zeros of a function given only its noisy estimates. It is extremely popular in application areas such as adaptive signal processing, adaptive resource allocation, artificial intelligence, etc. Due to the stochastic nature of these methods, analysis of their convergence and convergence rates is challenging. For generic noise settings, the most powerful analysis tool has been the Ordinary Differential Equation (ODE) approach. Its idea is to show that the noise effects average out so that the asymptotic behavior of a SA method is determined by that of a suitable deterministic ODE, often referred to as the limiting ODE. For more details on the above, see [7,6,12,15,26,5].
Here we analyze the behaviour of a nonlinear SA method close to a Locally Asymptotically Stable Equilibrium (LASE) of its limiting ODE 1 . In particular, we obtain a novel concentration bound for nonlinear SA methods. That is, given ǫ > 0 and that the current iterate is in a neighbourhood within the domain of attraction of a LASE, we provide estimates on i.) the time required to hit the ǫ−ball of this LASE, and ii.) the probability that after this time the iterates are indeed within this ǫ−ball and remain there thereafter. Since staying within the ǫ−ball of a LASE from some time on implies that the iterates will eventually converge to this equilibrium, the above probability estimate can also be viewed as an estimate on the so called 'lockin' probability [2], [7,Chapter 4]. Similar concentration bounds are already available in literature [7,21] in the context of generic attractors. Compared to these, our bound is stronger albeit restricted to the important special case of a LASE. We achieve the tighter bound by using a finer analysis which strongly exploits the behaviour of ODE solutions near a LASE.
In case of multiple stable attractors, SA methods have a positive probability of convergence to any of them [3], [2,Chapter 3], [5,Proposition 7.5]. Thus one cannot expect willful convergence to a specific equilibrium except in some special cases; e.g., stochastic gradient schemes controlled by addition of slowly decreasing noise [19]. Hence an important first step is to estimate the probability of convergence to an attractor given that the iterate is currently in its domain of attraction. The idea is that in such a situation the SA method will converge to the said attractor with high probability because the mean dynamics, as captured by the limiting ODE, favors it. This in fact is the basis for Arthur's models of increasing returns in economics [2]. To make the above qualitative (or descriptive) observation useful (or prescriptive) by giving it some predictive power, it is essential that those probabilities, the so called trapping or 'lock-in' probabilities [2], be estimated. This is what this work, and also [7,21], attempt to do. This is also related in spirit to the extensively studied phenomenon of metastability in statistical physics wherein a statistical mechanical system spends a long time near a stable minimum of its governing energy function other than its global minimum or ground state [9]; this would be the case, e.g., if we worked with constant stepsize SA methods instead of decreasing stepsizes.
In addition to our concentration bound, and more importantly, we provide 1 A LASE is an equilibrium that is Liapunov stable in the following sense: given an η > 0, there exists a δ > 0 such that any trajectory of the ODE initiated within δ distance from this equilibrium remains within η distance thereof; furthermore, there is an open neighbourhood such that any trajectory initiated therein converges to this equilibrium. This neighbourhood is called the domain of attraction.
here a novel approach for analysis of nonlinear SA. The main ingredient of our approach is Alekseev's formula [1]; an English account can be found in [10]. This formula extends the variation of constants formula [27] to nonlinear settings. That is, given two nonlinear ODEs where one can be treated as a perturbation of the other, Alekseev's formula gives an explicit expression for difference between the solutions of these two ODEs. The other ingredient of our approach is a novel concentration inequality for a sum of martingaledifferences that we prove separately; see Theorem A.2 in Appendix. This result is a generalization of [28, Theorem 1.1].
As remarked above, concentration bounds in [7,Chapter 4,Corollary 14] and [21,Theorem 12] are for generic attractors. But by taking the generic attractor to be a LASE, these results can be put in a form comparable to our result. It can be then seen that our bound is tighter and holds under significantly weaker assumptions on the stepsize and the noise sequence, but under a stronger regularity requirement (twice continuous differentiability) on the drift; see Section 2. In particular, our bound holds for a larger choice of stepsizes, e.g., 1/(n + 1) µ , µ ∈ (0, 1], while the previous ones only apply for stepsizes that are square-summable. Despite the weaker hypothesis, we show that the celebrated Kushner-Clark lemma [25] continues to hold. All these happen mainly because the earlier two works use the weaker Gronwall inequality [4, Corollary 1.1], while here we use the tighter Alekseev's formula to compare the SA trajectory to a suitable solution of its limiting ODE. In particular, Alekseev's formula allows us to better compare the two in the neighbourhood of a LASE on an infinite time interval. Concentration bounds of similar flavor to our work have also been obtained recently in [18,Theorem 2.2] and [16,Corollary 2.9]. But as shown in Section 2, compared to our work and also to [7,21], these recent results apply only to a restrictive class of SA methods and hold only under strong assumptions. In particular, results from [18,16] only apply to SA methods: i) whose form are special cases of the generic model that we handle; ii) whose limiting ODE has a unique, globally asymptotically stable equilibrium; and iii) that satisfy respectively the assumption labelled HL ( [18]) and HLS α ( [16]), amongst others. Both HL and its weaker variant HLS α are strong assumptions; for e.g., they do not hold for the simple yet popular TD(0) method with linear function approximation from reinforcement learning [13]. Under these settings, their results give unconditional convergence rates. This is possible because of the unique equilibrium hypothesis; we shall discuss this issue further in Section 7.
We now formally describe our setup and our key result in this paper. We Let x * be a LASE of (1.2) such that Dh(x * ) is Hurwitz and let B be a bounded set containing x * and contained in the domain of attraction of x * . Let · denote the usual Euclidean norm for vectors and matrices. Letx(t) denote the continuous time version of (1.1) obtained via linear interpolation. That is, let t 0 = 0 and, for each n ≥ 0, set t n+1 = t n + a n andx(t n ) = x n . For t ∈ (t n , t n+1 ), let Let n 0 ≥ 0. Then given ǫ > 0 and that the event {x(t n 0 ) ∈ B} holds, our aim here is to obtain: 1. an estimate on the time T, starting from t n 0 , that the SA method in (1.1) will take to hit the ǫ−ball around x * ; and 2. a lower bound on the probability that the SA method at time t n 0 +T +1 is indeed inside the ǫ−ball around x * and remains there thereafter, i.e., We assume the following throughout this paper.
The noise sequence {M n } is a R d valued martingale-difference sequence with respect to the increasing family of σ−fields That is, Furthermore, there exist continuous functions c 1 , c 2 : R d → R ++ (strictly positive) such that for all u ≥ u L , where u L is some sufficiently large but fixed number.
There exist r, r 0 , ǫ 0 > 0 so that r > r 0 and, for 0 < ǫ ≤ ǫ 0 , V r is defined similarly to V r 0 with r replacing r 0 , and Remark 1.1. Unlike most existing SA works [7,21,18,16] etc., we do not require that the stepsize sequence {a n } satisfy the square summability condition, i.e., n a 2 n < ∞. Therefore, compared to these works, our analysis holds for larger choices of stepsizes; for e.g., a n = 1/(n + 1) µ , with µ ∈ (0, 1/2]. As pointed out to us by an anonymous referee, similar slowly decaying stepsize sequences have appeared in [31]; but there they appear only as part of the analysis for linear SA methods. Remark 1.2. We emphasize that the assumption h is twice continuously differentiable globally is only for pedagogical convenience. Our results go through even if h is twice continuously differentiable in some local neighbourhood of x * . Assumption (1.6) is again for ease of notation. Our results with minor modifications can be obtained even without it. 2 Recall that a continuously differentiable function V : dom(V ) ⊆ R d → R is said to be a Liapunov function with respect to x * if V (x * ) = 0 and, for all x = x * , V (x) > 0 and ∇V (x) · h(x) < 0. The existence of a Liapunov function near x * is guaranteed due to its asymptotic stability by the converse Liapunov theorem [23]. We may in fact choose V so that V (x) → ∞ as x → the boundary of dom(V ) (see ibid.).
Let ǫ ∈ (0, ǫ 0 ], where ǫ 0 is as in A A A 4 . Then there exist constants C 1 , C 2 > 0 and functions g 1 (ǫ) = O log 1 ǫ and g 2 (ǫ) = O 1 ǫ so that whenever T ≥ g 1 (ǫ) and n 0 ≥ N, where N is such that 1/a n ≥ g 2 (ǫ) ∀n ≥ N, the SA iterates of (1.1) satisfy: Here the constants C 1 , C 2 as also the hidden constants in g 1 , g 2 depend only on λ, d, r, and u L .
The next result obtains order estimates for our concentration bound, for the common stepsize family a n = 1/(n + 1) µ , µ ∈ (0, 1]. Theorem 1.2. Let a n = 1/(n + 1) µ , µ ∈ (0, 1]. With notations as in Theorem 1.1, keeping everything else fixed and treating only n 0 as a variable, for some constant C > 0. Here O denotes the standard Big O order notation. 3 In the special case that Dh(x * ) is symmetric and hence diagonalizable,K and λ ′ can be chosen to be 1 and λmin(x * ), respectively.
Proof. See Section B in Appendix.
Some notable aspects of Theorem 1.1 are as follows.
• It is a local result, i.e., it gives a bound on the probability of convergence to a LASE if the iterates land up in its domain of attraction eventually. This is the so called lock-in probability [2]. In particular, {x * } need not be the only attractor of (1.2). • Letting A(n) denote complement of the event whose conditional probability appears in the statement, we have a bound of the form for a suitably defined c(n 0 ) satisfying n c(n) < ∞. Therefore, n Pr{A(n)|x n ∈ B}I{x n ∈ B} < ∞ a.s., where I denotes the indicator function. Consequently, by [11,Corollary 5.29,p. 96], we have n I{A(n), x n ∈ B} < ∞ a.s.
In particular, this implies that x n → x * a.s. on the set {x n ∈ B i.o.}. Thus we recover the celebrated Kushner-Clark lemma [25] under the weaker hypothesis a n → 0 replacing the usual condition n a 2 n < ∞. There is also one key limitation to this result. The concentration bound is conditional on the event {x(t n 0 ) ∈ B}. Thus in order to drive the iterates to a prescribed equilibrium, one will need to separately ensure that the n 0 −th iterate is indeed within the set B. A related issue is to estimate the unconditional probability of convergence to a prescribed equilibrium. This requires an estimate of the probability of reaching the domain of attraction of the prescribed equilibrium from a given starting point. We discuss this in Section 7. Three artificial ways to fix this are as follows. First, forcefully project the SA method back onto the set B whenever the method leaves it; see [14] for recent advances in this direction. Second, pick an initial point within B and scale-down the entire stepsize sequence so that N given in Theorem 1.1 equals 0. Third, use additional additive, extraneous noise to 'explore' the space, along with an oracle that tells you when you are in B. All these fixes are non-trivial as these require explicit a priori knowledge of the desired equilibrium (e.g., a global minimum of some function) because often (e.g., in engineering applications) that is precisely what the algorithm is expected to discover. Also, the set B is often unavailable or hard to estimate well even when the desired equilibrium is known. Separately, the second fix needs bounds on g 2 (ǫ) which, as we shall see, depends on a priori unknown parameters such as the smallest eigenvalue of Dh(x * ), amongst others. But whenever such estimates can be obtained, the following restatement of Theorem 1.1 may be useful.
. Let β n , T, and g 2 (ǫ) be as in Theorem 1.1. Suppose that 1/a n ≥ g 2 (ǫ) ∀n ≥ 0 and that x(0) = x 0 lies in B. Then the following relation holds: Here C 1 , C 2 > 0 are constants as in Theorem 1.1.
The rest of the paper is organized as follows. In the next section, we give a comparison of our main result with existing works. This section may be skipped at the first reading. In the following section, we do some preliminary computations and get an intermediate lower bound on (1.4) which will be easier to work with. We also give an overview of our proof technique for Theorem 1.1. In Section 4, we first give Alekseev's formula. Using this, we then derive an alternative but equivalent expression forx(t) and, in particular, forx(t n ). In Section 5, we use this alternative expression to obtain a bound on x(t n+1 ) − x * in terms of the noise sequence {M n }. In Section 6, we finally prove our main result, i.e., Theorem 1.1, via a series of Lemmas. This section needs a generalization of a concentration result from [28], which we prove separately as Theorem A.2 in the Appendix. We conclude with a brief discussion in Section 7.
2. Comparison with existing results. We first compare our key result (Theorem 1.1) with [7, Chapter 4, Corollary 14] and [21,Theorem 12]. Those results give concentration bounds for nonlinear SA methods with respect to generic attractors. Replacing the generic attractor with a LASE, the results there are in a form directly comparable to our result. Let B and x * be as in Theorem 1.1. Those works first give an estimate on the additional time T ′ required to hit a suitably defined ǫ−neighbourhood of x * starting from B. This estimate is same in both those works. Then, an estimate is provided of the probability that after a passage of time T ′ the iterates are within the aforementioned ǫ−neighbourhood and remain there thereafter, conditional on {x n 0 ∈ B}. The estimate on T ′ is very loose; unlike our bound, it does not exploit the exponentially fast convergence rate of a ODE solution near a LASE. We compare their concentration bounds separately below.
There are two parts to [7,Chapter 4,Corollary 14]. The first part assumes for some constant C 1 ≥ 0. Under these assumptions, as n 0 → ∞ with everything else fixed, it is shown that the concentration bound, defined above, Here b n 0 := n≥n 0 a 2 n and δ is some constant depending on ǫ. For a n = 1/(n + 1) µ , µ ∈ (1/2, 1], it can be easily seen that the above concentration bound is In the second part of [7, Chapter 4, Corollary 14], the assumptions on h and {a n } are same as above. The difference is in the assumption on {M n }. It is assumed there that {M n } is a martingale-difference sequence such that for some constant C 1 ≥ 0. Under these assumptions, a concentration bound Here C 2 > 0 is another constant and δ, b n 0 are as above. For a n = 1/(n + 1) µ , µ ∈ (1/2, 1], this concen- Clearly the concentration bound in the second part is tighter. But the bounded noise assumption of (2.3) is restrictive and does not hold true in general (this setting is however very useful for many reinforcement learning problems).
The result of [21, Theorem 12] significantly improves on this. In addition to (2.1) and (2.2), it is only assumed there that {M n } is a martingaledifference sequence and that, for some constants C 1 , C 2 > 0, for all sufficiently large u. Under these assumptions, it is shown there that for some constant C 3 > 0 and δ, b n 0 as above. For a n = 1/(n + 1) µ , µ ∈ (1/2, 1], it is easy to see that this bound translates to 1 − O(e −C 4 n µ/2−1/4 0 ) for some constant C 4 > 0. Compared to the second part of [7, Chapter 4, Corollary 14], this bound is weaker but it also has a similar exponential behaviour in n 0 .
Our result, i.e., Theorem 1.1 of this paper, significantly improves upon the above two results. First, we do not need that the stepsize sequence satisfy the square summability condition ∞ n=0 a 2 n < ∞; instead, we only require that a n → 0 (in addition to ∞ n=0 a n = ∞). Thus, our result holds even for stepsizes such as a n = 1/(n + 1) µ , with µ ∈ (0, 1/2], while the previous two do not. Second, we only require that the noise sequence {M n } satisfy A A A 3 . This is weaker than the assumption on {M n } made in [21,Theorem 12] (and hence in the second part of [7, Chapter 4, Corollary 14]). Third, despite the weaker assumptions, a direct comparison of our concentration bound for a n = 1/(n + 1) µ , µ ∈ (1/2, 1] (see Theorem 1.2), shows that our bound betters that in [21,Theorem 12] for all µ ∈ (1/2, 1] and the one in the second part of [7, Chapter 4, Corollary 14] for µ ∈ (1/2, 2/3). Lastly, by exploiting the exponential convergence of a ODE solution near its attractor, we obtain tighter estimates for the time T required to hit the ǫ−ball around x * starting from the neighbourhood B. We do, however, require a stronger regularity of the function h, viz., twice continuous differentiability, at least locally near the equilibrium. A brief summary of the above comparison is given in Table 2.1.
The main reason why we obtain a tighter concentration bound in comparison to [7,Chapter 4,Corollary 14] and [21,Theorem 12] is the following. In [7,21], the analysis boils down to showing n k=n i a k M k+1 is small in magnitude with high probability for all appropriately large n i and n. In contrast, in the proof of our result, we only need to show that a term similar to where λ is as in (1.11), is small for all large n with high probability. This happens mainly due to the use of Alekseev's formula [1] which allows us to exploit the local stability of the ODE near an attractor. Further, to show that the term similar to n k=n 0 e −λ[ n i=k+1 a i ] a k M k+1 is small, we make use of the concentration inequality given in Theorem A.2 in place of the Azuma-Hoeffding inequality as in [7,Corollary 14] and [21,Theorem 12].
Concentration bounds related to our work are also given in [18, Theorem 2.2] and [16, Corollary 2.9]. However we discuss these separately since, as mentioned before, these results only apply to a restrictive class of SA methods and hold only under strong assumptions. Specifically, the SA algorithms Lipschitz continuous B1 and B2 are respectively the first and second parts of [7, Chapter 4, Corollary 14], K is [21,Theorem 12], and * is Theorem 1.1 from this paper. Each Ci, C denotes a positive constant, O is the Big O notation, whileÕ is the Big O notation with polynomial terms hidden. The concentration bounds are obtained assuming an = 1/(n + 1) µ with µ ∈ (1/2, 1] for the first three bounds, while µ ∈ (0, 1] for the last one.
considered there are of the form where: i.) H : R d ×R d → R d is a deterministic map satisfying the assumption labelled HL in [18] and HLS α in [16], amongst others; ii.) {a n } is a real valued step size sequence satisfying (2.2); and iii.) {Y n } is a R d valued sequence of IID random variables satisfying the Gaussian concentration property, i.e., there exist some α > 0 so that for every 1−Lipschitz function f : By adding and subtracting E[H(x n , Y n+1 )|F n ], where F n is the σ−field σ(x 0 , Y 1 , . . . , Y n ), it is easy to see that (2.4) can be rewritten as in (1.1); thereby showing that the above SA model is a special case of our SA model. Further, they assume that the limiting ODE has only one unique solution x * ; again a substantial simplification of the setup we consider. Both HL and HLS α relating to the growth of H with respect to the second parameter are strong assumptions; in that they do not hold for the simple yet popular TD(0) method with linear function approximation [13]. As already discussed before, the square summability assumption on the stepsize is again stronger than ours. Lastly, since the Gaussian concentration property for {Y n } needs to hold true for every 1−Lipshitz function f, this requirement is also restrictive when compared with (1.8). Under the restrictive settings and assumptions mentioned above, [18,16] obtain an upper bound on There is a separate bound on δ n which must be combined with the above to get the overall error bound. Note that (2.5) is unconditional, which is possible because of the strong assumptions and since there is a unique globally asymptotically stable equilibrium. Overall, our concentration bound is of a similar flavor to the ones obtained in [18,16].
3. Preliminary computations. Henceforth, for u 0 ∈ R d and s ≥ 0, we shall use x(t, s, u 0 ), t ≥ s, to denote the solution of (1.2) satisfying x(s, s, u 0 ) = u 0 . Suppose for the time being thatx(t n 0 ) ∈ B. Since from A 4 A 4 A 4 , B ⊆ V r 0 and V is a Liapunov function, we have x(t, t n 0 ,x(t n 0 )) ∈ V r 0 for all t ≥ t n 0 . Further, if we wait long enough, then x(t, t n 0 ,x(t n 0 )) will reach a sufficiently close enough neighbourhood of x * and remain in it thereafter. Our idea to prove Theorem 1.1 is to show that with very high probability, conditional on {x(t n 0 ) ∈ B}, x(t) − x(t, t n 0 ,x(t n 0 )) is small for all t ≥ t n 0 . Note thatx(t) and x(t, t n 0 ,x(t n 0 )) start from the same pointx(t n 0 ) at time t = t n 0 . We elaborate more on our idea at the end of this section. But we first introduce some notations and come up with an intermediate lower bound on (1.4) which will be much easier to work with.
Fix some sufficiently large n 0 , T. We shall elaborate later on how large they ought to be. Pick n 1 ≡ n 1 (n 0 ) such that This can be done because (1.5) and (1.6) hold. Let Note that G n is an event and The desired intermediate lower bound on (1.4) is given below.
In the remaining part of this proof, we obtain a superset of the event in the second term in (3.6). This will help us prove the desired result.
For any event E, let E c denote its complement. Then between any two events E 1 and E 2 , the following relation is easy to see.
Using this, it follows that and hence Recall that, on the event {x(t n 0 ) ∈ B}, x(t, t n 0 ,x(t n 0 )) ∈ V r 0 for all t ≥ t n 0 .
Combining this with the assumption from where the last relation follows because ǫ 0 ≥ ǫ (see A 4 A 4 A 4 ). Arguing similarly and using the assumption from Putting the above discussions together, we have which in combination with (3.6), gives the desired result.
We now elaborate on our technique to prove Theorem 1.1 and the usefulness of (3.5) for the same. First note that to obtain a lower bound on (1.4), it suffices to obtain an upper bound on the second term on the RHS of (3.5). Indeed, this is what we do. This is also easier because we now only need to obtain bounds on ρ n+1 and ρ * n+1 on the event G n . This has been done in Lemmas 5.10 and 5.11 in Section 5, where S n is an appropriate sum of martingale-differences. To show that the terms on the RHS there are small, we use the concentration inequality in Theorem A.2 and the assumption in (1.8). In the next section, we describe Alekseev's formula and use it to give an alternative expression forx(t n ). This will be very useful for proving Lemmas 5.10 and 5.11.

Alekseev's formula and an alternative expression forx(t n ).
Alekseev's formula given below provides a recipe to compare two nonlinear systems of differential equations. This is a generalization of variation of constants formula.
Theorem 4.1 (Alekseev's formula, [1]). Consider a differential equatioṅ and its perturbationṗ (t) = f (t, p(t)) + g(t, p(t)), t ≥ 0, where f, g : R × R d → R d , f is continuously differentiable everywhere, and g is continuous everywhere. Let u(t, t 0 , p 0 ) and p(t, t 0 , p 0 ) denote respectively the solutions to the above nonlinear systems for t ≥ t 0 satisfying u(t 0 , t 0 , p 0 ) = p(t 0 , t 0 , p 0 ) = p 0 . Then, where Φ(t, s, u 0 ), for u 0 ∈ R d , is the fundamental matrix of the linear system See [10, Lemma 3] for an English version of the original proof for the above result. We now use this result to comparex(t) with x(t, t n 0 ,x(t n 0 )). Using (1.1) and since t k+1 − t k = a k , k ≥ 0, we have, for any n ≥ n 0 , For k ≥ n 0 and s ∈ [t k , t k+1 ], define Then it is easy to see that for n ≥ n 0 Think of (1.2) as the unperturbed ODE and (4.4) as its perturbation. The perturbation term at time t is of course ζ 1 (t) + ζ 2 (t), which is piecewise continuous in t. The same proof that was used to prove Theorem 4.1 also holds in this context. Hence, using Alekseev's formula, we get (4.5)x(t) = x(t, t n 0 ,x(t n 0 )) + t tn 0 Φ(t, s,x(s))ζ 1 (s)ds where Φ(t, s, u 0 ), for any u 0 ∈ R d , is the fundamental matrix of the nonautonomous linearized system (4.6)ẏ(t) = Dh(x(t, s, u 0 ))y(t), t ≥ s, with Φ(s, s, u 0 ) = I d . Here Dh(x(t, s, u 0 )) is the Jacobian matrix of h along the solution trajectory x(t, s, u 0 ).
with Φ(t n , s,x(s)) being the fundamental matrix of (4.6) with u 0 =x(s).
Remark 4.1. Note that {S n } is a sum of martingale-differences with respect to {F n }, whileS n is not. We shall exploit this later while proving Theorem 1.1.

5.
Bound on ρ n+1 , ρ * n+1 on G n . Our aim here is to obtain a bound on ρ n+1 , ρ * n+1 on the event G n . This is given in Lemmas 5.10 and 5.11. We shall use this in Section 6 to obtain a bound on the second term on the RHS of (3.5) and hence on (1.4). The proof of the above mentioned results require some supplementary lemmas which we prove first. Across these lemmas, we shall repeatedly use the linear ODE This is the linearization of (1.2) near x * . We shall also use r as in A 4 A 4 A 4 and Hence it follows that h and Dh are Lipschitz continuous over the compact set V r . Let L h and L D , respectively, denote the associated Lipschitz constants.
Lemma 5.1. Let λ be as in (1.11). Let u 0 , u 1 be arbitrary points in V r and s be an arbitrary positive real number. Then for t ≥ s, Proof. We first prove the following claim. Claim (i) There exists r ′ satisfying 0 < r ′ < r with the following property. For any arbitrary u 0 , u 1 ∈ V r ′ and any s ≥ 0, where ⊤ denotes transpose. Since Dh(x * ) is Hurwitz, P is well defined. It is also easy to check that P is symmetric and positive definite. From [22, Theorem 4.6, p. 136], we further have that P is the unique positive definite and symmetric matrix satisfying the Liapunov equation where, as mentioned before, I d is the d−dimensional identity matrix. Let where κ is as defined below (1.9). The existence of N (x * ) is guaranteed since Z is continuous. The latter follows due to A 1 A 1 A 1 which ensures that Dh is continuous.
Fix r ′ such that 0 < r ′ < r and V r ′ ⊆ N (x * ). Fix s, u 0 , and u 1 as prescribed in Claim (i) with r ′ as defined above. For notational convenience, let x 0 (t) ≡ x(t, s, u 0 ) and x 1 (t) ≡ x(t, s, u 1 ).
Also let Observe that since P is positive definite, V(t) ≥ 0 for all t ≥ s. Differentiating with respect to t and using the fact thaṫ it is easy to see thaṫ By the mean value theorem, Hencė where Z is as in (5.5).
Since V is a Liapunov function and Hence by adding and subtracting I d to the integrand in the relation con-cerningV(t) above, it follows thaṫ By definition of V(t) in (5.6), also note that Combining the above two relations, we geṫ But using (5.3) and (1.10), note that Hence using (1.11), we havė and consequently, by integrating from s to t, Since from (5.6), it follows that where K ′ 1 := P λ min (P ) . This proves Claim (i), as desired. We now proceed to prove the actual lemma. Pick arbitrary u 0 , u 1 ∈ V r and s ≥ 0. Observe that Hence we have Since u i ∈ V r and V is a Liapunov function, x(t, s, u i ) ∈ V r for each t ≥ s. Hence, invoking the Lipschitz continuity of h over V r , it follows that Using Gronwall inequality [4, Corollary 1.1] on this, we get for any t ≥ s. Let r ′ be as in Claim (i) and let As V is a Liapunov function, inf x∈V r \V r ′ |∇V (x)·h(x)| > 0. SinceV (x(t)) = ∇V (x(t)) · h(x(t)), T is an upper bound on the time taken for a solution of (1.2) starting from any point in V r to reach V r ′ . That is, x(s+T , s, u i ) ∈ V r ′ whatever be the values of s ≥ 0 and u i ∈ V r . Combining this with Claim (i) above, it follows that for all t ≥ s + T ,
Combining the above two, it follows that for t ≥ s + T , Hence for suitable K 1 ≥ 0, we have for all t ≥ s. This proves the desired result.
Lemma 5.2. Let u 0 ∈ V r and s ≥ 0 be arbitrary. Then for any t ≥ s, where K 2 ≥ 0 is some constant.
Proof. Recall that Dh is Lipschitz continuous over the compact set V r with Lipschitz constant L D . Separately, since V is a Liapunov function and u 0 ∈ V r , we have x(τ, s, u 0 ) ∈ V r for all τ ≥ s. Hence it follows that where the second relation follows from Lemma 5.1 on substituting u 1 = x * , while the truth of the last one can be seen using (5.2). Since the desired result is now easy to see.
Lemma 5.3. Let u 0 ∈ V r and s ≥ 0 be arbitrary. Let Φ(t, s, u 0 ), t ≥ s, be as defined above (4.6). Then for t ≥ s, where K 3 ≥ 0 is some constant.
Proof. Observe that (4.6) can be written aṡ which can be thought of as a perturbation of (5.1). Hence, using the variation of constants formula or equivalently the Alekseev formula (column by column), we get where λ is as in (1.11). Hence by taking spectral norm on both sides of (5.9), we have Using Gronwall inequality [4, Corollary 1.1] on this, we get By Lemma 5.2, the desired result follows.
Similarly define Ψ 1 (t, s ′ ) with respect to (5.11). Treating (5.11) as a perturbation of (5.10), it follows by using variation of constants formula or equivalently Alekseev's formula (column by column) that Since u 0 , u 1 ∈ V r , it follows by arguing as in Lemma 5.3 that Also recall that Dh is Lipschitz continuous on V r with Lipschitz constant L D . Hence we have Putting all the above relations together, it follows that there exists some Using Lemma 5.1, the desired result is now easy to see.
Lemma 5.5. Let k, n with n 0 ≤ k < k + 1 ≤ n be arbitrary. Then there exists a constant K 5 ≥ 0 such that, on the event G n , where the last inequality is due to (1.1). On G n , and since n 0 ≤ k ≤ n − 1, note thatx(t k ) ∈ V r . Combining this with the fact that h(x * ) = 0 and h is Lipschitz over V r , it follows that, on G n , Also note that Combining the above relations, the desired result is easy to see.
In the next two results, we respectively obtain bounds on W n andS n −S n , where W n ,S n , and S n are as in (4.8), (4.9), and (4.10).
Lemma 5.6. Let n ≥ n 0 be arbitrary. Then on G n , where K 6 ≥ 0 is some constant.

Proof.
Recall that h is Lipschitz over V r with Lipschitz constant L h . Also, observe that Therefore, the following relations hold on the event G n . First,x(s) ∈ V r for each s ∈ [t n 0 , t n ]. Hence,

Applying Lemma 5.5 to this gives
From this, it follows that there exists some constant K ′ 6 ≥ 0 so that But observe that where the last inequality is due to (1.6). The desired result now follows.
Lemma 5.7. Let n ≥ n 0 be arbitrary. Then on G n , where K 7 ≥ 0 is some constant.
Proof. Observe that Hence, the following statements hold on the event G n . Clearly,x(s) ∈ V r for each s ∈ [t n 0 , t n ]. Consequently, using Lemma 5.4, it follows that Applying Lemma 5.5 to this shows that Arguing now as in Lemma 5.6, the desired result is easy to see.
Assuming the event G n occurs, we now obtain upper bounds on x(t n ) − x(t n , t n 0 ,x(t n 0 )) and x(t n+1 ) − x(t n+1 , t n 0 ,x(t n 0 )) and use this to obtain bounds on ρ n+1 and ρ * n+1 .
Lemma 5.8. Let n ≥ n 0 be arbitrary. Then on G n , where K 8 ≥ 0 is some constant.
Proof. From Theorem 4.2, we have x(t n ) − x(t n , t n 0 ,x(t n 0 )) ≤ W n + S n + S n − S n .
Using Lemmas 5.6, 5.7, and the fact that x ≤ 1 + x 2 , the desired result is easy to see.
Lemma 5.9. Let n ≥ n 0 be arbitrary. Then on G n , where K 9 ≥ 0 is some constant.
Proof. Using (1.1) and x(t n+1 , t n 0 ,x(t n 0 )) = x(t n , t n 0 ,x(t n 0 )) + t n+1 tn h(x(s, t n 0 ,x(t n 0 )))ds, it follows from the triangle inequality that But h is Lipschitz over V r with Lipschitz constant L h . Also, on G n ,x(t n ) and x(s, t n 0 ,x(t n 0 )), s ≥ t n 0 , lie in V r . Hence it follows using (5.2) that t n+1 tn h(x(t n )) − h(x(s, t n 0 ,x(t n 0 ))) ds ≤ 2L h Ra n .
Substituting this in the above relation and using Lemma 5.8, the desired result is easy to see.
Lemma 5.10. Let n ≥ n 0 be arbitrary. Then on G n , ρ n+1 ≤ K 10 S n + sup where K 10 ≥ 0 is some constant.
Hence arguing as in the proof of Lemma 5.9, it follows that on G n , Substituting this in (5.12) and making use of Lemmas 5.8 and 5.9, the desired result is easy to see.
Using Lemma 5.10, the desired result is easy to see.
Let K := max{K 10 , K 11 }. The following result is then straightforward.
Theorem 5.1. Let n ≥ n 0 be arbitrary. Then on G n , and ρ * n+1 ≤ K S n + sup where K ≥ 0 is some constant.
6. Proof of Theorem 1.1. Our first result here gives an upper bound for the probability expression on RHS of (3.5) in terms of { S n } and {a n M n+1 2 }.
Theorem 6.1. Letx(t) be as in (1.3), K be as defined in Theorem 5.1, n 1 be as in (3.1), and ǫ be as in Theorem 1.1. Let N be such that a n ≤ ǫ/(4K) for all n ≥ N, and T be such that e −λT ≤ ǫ/(4K). Then for any n 0 ≥ N, Proof. From (3.1), it follows that t n ≥ t n 0 + T for each n ≥ n 1 + 1. Since e −λT ≤ ǫ/(4K), it follows that for each n ≥ n 1 + 1, e −λ(tn−tn 0 ) ≤ ǫ/(4K). Combining this with the fact that n 0 ≥ N, it follows from Theorem 5.1 that, for n 0 ≤ n ≤ n 1 , and, for n ≥ n 1 + 1, For n 0 ≤ k ≤ n, note that G n ⊆ G k and hence Thus for n ≥ n 0 , Putting the above relations together, we have The desired result is now easy to see.
We now sequentially derive bounds for the two expressions on RHS of (6.1). Let K 12 := sup x∈V r c 1 (x) and K 13 := inf x∈V r c 2 (x)/(2 √ K). Since V r is a compact set, it follows that K 12 , K 13 ∈ (0, ∞). Theorem 6.2. Letx(t) be as in (1.3), K be as in Theorem 5.1, ǫ be as in Theorem 1.1, and N be as in Theorem 6.1. Then for n 0 ≥ N, Proof. Observe that where the last inequality follows due to (1.8) and the fact thatx(t n ) ∈ V r on the event G n . This proves the desired result.
Theorem 6.3. Letx(t) be as in (1.3), K be as in Theorem 5.1, ǫ and β n be as in Theorem 1.1, N be as in Theorem 6.1, and S n be as in (4.10). Then for some constants K 14 ≥ 0 and K 15 > 0, the following relation holds: Proof. Let (6.2) α k+1,n := t k+1 t k Φ(t n , s,x(t k ))ds.
Then S n = n−1 k=n 0 α k+1,n M k+1 . Since G n 0 ⊇ · · · ⊇ G n−1 ⊇ G n , we have The last but one equality follows since 1 Gn 0 = · · · = 1 G n−1 = 1 on G n−1 . To prove the desired result, it thus suffices to show that there exist constants K 14 ≥ 0 and K 15 > 0 so that the following relation holds: is a sum of martingale-differences. Hence the above two relations follow directly from a conditional variant of Theorem A.2 and the discussion in Remark A.1 provided there exist constants δ, C, γ 1 , γ 2 > 0 so that s., k ≥ n 0 + 1, and (6.5) max In the remainder of this proof, we establish (6.3), (6.4), and (6.5). Pick arbitrary F k−1 ∈ F k−1 . Then observe that . Also note that when u ≥ e δu L , we have log u/δ ≥ u L . Therefore, where the last inequality follows from A 3 A 3 A 3 and the fact that x k−1 ∈ V r on the event G k−1 .
If we pick δ = K 13 /2, it follows from the above two inequalities that Substituting this in (6.6), it follows that for C = exp[K 13 u L /2] + K 12 exp[K 13 u L /2] + 1 and δ = K 13 /2, Since F k−1 ∈ F k−1 was arbitrary, we have This establishes (6.3). Next note from Lemma 5.3 that, on G k , Hence from (6.2), as in the Proof of Lemma 5.6, it follows that and max as desired in (6.4) and (6.5). This completes the proof.
We end this section with a brief comment on how one may estimate the constant K ′ 1 defined in the proof of Lemma 5.1. This is a key constant since all other constants defined throughout Sections 5 and 6 essentially depend on it. First,K and λ ′ defined in (1.10) do depend on the prior knowledge of x * which is usually unavailable. One can, though, use a loose estimate based on the knowledge of Dh in a neighborhood of x * , if available. Having choseñ K and λ ′ , an estimate of P can then be easily found via (5.7). When the matrix Dh(x * ) is symmetric, one can be a bit more explicit. In that case, P = −[Dh(x * )] −1 /2 andK and λ ′ can be chosen as in Footnote 3; consequently, K ′ 1 is precisely the square root of the condition number of Dh(x * ). It may be noted that even in absence of explicit constants, our concentration bound does provide useful information as 'order' estimates in the spirit of sample complexity in machine learning [34].
7. Discussion. Here we first look at the issue of obtaining unconditional convergence rates/concentration bounds, as opposed to ours which is conditioned on the iterate being in the domain of attraction of a given equilibrium. An unconditional estimate will be a product of our estimate times the probability that the conditioning event occurs, i.e., the domain of attraction is indeed reached (one might add a qualifier 'after a specified time'); e.g., see Proposition 7.5, [5]. As already noted, the latter is strictly positive for any stable equilibrium under reasonable hypotheses; hence, the primary task is to find a good estimate thereof.
The simplest case is when the limiting ODE has a single globally asymptotically stable equilibrium. In a recent work [13], we obtained unconditional convergence rates for the special case of TD(0) with linear function approximation; this is a popular algorithm in reinforcement learning. There the limiting ODE is linear and consequently has only one unique equilibrium.
The key idea there is to first obtain a high probability bound on how far the TD(0) iterates can go when the stepsizes are initially large. Once the stepsizes become sufficiently small, analysis of the present work is invoked in order to show that the TD(0) iterates closely follow an appropriate solution of the ODE with high probability. Hence, combining ideas from [13] and this work, we believe that it may be possible to obtain unconditional convergence rates for nonlinear SA methods whose limiting ODE has a unique, global, asymptotically stable equilibrium, as in [18,16], but without having to resort to the strong HL or HLS α type assumptions.
For the case of multiple equilibria/attractors, one has to distinguish between two scenarios. First is the case when the equilibria are unknown and while one of them may be the most desirable, the a priori description of it does not allow us to say anything about its location. This is commonplace in engineering applications; a prime example being the stochastic gradient scheme for minimization which guarantees convergence only to a local minimum whereas the desired goal is the global minimum. One way to ensure the latter is to add extraneous slowly decreasing noise, which leads to the simulated annealing algorithm [19]. For a non-gradient scheme a similar ploy may be expected to lead to the minimum of the so called Freidlin-Wentzell potential [17]; to our knowledge this has been worked out so far only in discrete state space [29] and compact Riemannian manifolds [30].
The other possible scenario is where there may be some prior information about possible equilibria/attractors and we wish to reach a most preferred one. This may be the case, e.g., in models arising in economics; in fact this was the original motivation for Arthur to look at lock-in probability. Then the issue is what aspect of the dynamics, given that it is a socioeconomic process and not an algorithm, is in our control. In other words, can we affect the probability to reach the domain of attraction of the desired equilibrium from the given starting point. A natural and commonplace situation is when the initial point is in the domain of attraction of an undesired equilibrium. Then the complement to our probability estimate (i.e., 1− the estimate) is an upper bound on the probability of escape from it. The paths from the initial point to the desired set may traverse several such domains of attraction and the upper bound will then involve all such estimates, for all possible traversal sequences. This is an interesting direction to pursue in future. A second issue then is to improve this probability if we have any control over the dynamics, including the possibility of adding noise as described above. This is a more interesting class of problems with overtones of 'stochastic resonance' [20].
Other interesting directions to pursue are extensions to distributed asyn-chronous algorithms and more general noise models such as Markov noise. We end by pointing at some recent papers that build upon the ideas discussed here, thereby illustrating the usefulness of this work. In [14] and [8], concentration bounds have been obtained for two-timescale SA; the first one deals with the linear case, while the second one handles the generic nonlinear setup. Separately, [24] studies constant stepsize SA used to track a slowly moving target and provides bounds on the tracking error.
for all k ≥ 1. Let S n = n k=1 α k,n X k , where {α k,n } are a.s. bounded previsible real valued random variables. That is, α k,n ∈ F k−1 and there is a finite positive deterministic number, say A k,n , such that |α k,n | ≤ A k,n a.s. Suppose n k=1 A k,n ≤ γ 1 and max 1≤k≤n A k,n ≤ γ 2 β n , where {β n } is some positive sequence and γ 1 , γ 2 > 0 are constants that are independent of n. Then there exists some constant c > 0 depending on δ, C, γ 1 , γ 2 such that, for ξ > 0, We divide the proof into a series of lemmas.
Proof. This follows from iterated conditioning.
Since this holds true for each 0 < ω < 1/(γ 2 β n ), we have Using the proof of [28, (2.4)], it eventually follows that Similarly, one can show that The desired result follows.
The next result is a multivariate version of Theorem A.1.
Theorem A.2. Let S n = n k=1 α k,n X k , where {X k } is a R d valued {F k }−adapted martingale-difference sequence and {α k,n } is a sequence of a.s. bounded previsible real valued d × d random matrices. That is, α k,n ∈ F k−1 and there exists a finite number, say A k,n , such that α k,n ≤ A k,n a.s. Suppose that for some δ, C > 0 E[e δ X k |F k−1 ] ≤ C a.s.
for each k ≥ 1. Further assume that n k=1 A k,n ≤ γ 1 and max 1≤k≤n A k,n ≤ γ 2 β n , where {β n } is some positive sequence and γ 1 , γ 2 > 0 are constants that are independent of n. Then there exists some constant c > 0 depending on δ, C, γ 1 , γ 2 such that, for ξ > 0, Proof. Let α ij k,n denote the (i, j)-th entry of the matrix α k,n . Similarly, let X j k denote the j−th entry of the vector X k . Then, it is easy to see that the i−th entry of the vector S n satisfies (A.2) S i n = d j=1 n k=1 α ij k,n X j k .
To prove this, it suffices to show that, for all sufficiently large n 0 , and k ≥ n 0 , But observe that a k a k+1 = k + 2 k + 1 ≤ e 1/(k+1) .
Hence (B.1) holds and our claim follows. From this, for all sufficiently large n 0 and n ≥ n 0 , β n = a n−1 = 1/n. Hence, for all sufficiently large n 0 , where the latter follows by treating the sum as a geometric series. The desired result now follows. Next, consider the case λ ≤ 1. Clearly, a k+1 a k = k + 1 k + 2 ≤ e −1/(k+2) ≤ e −λa k+1 .
Hence, by arguing as above, it is easy to see that e −λ n−1 i=k+1 a i a k ≥ e −λ n−1 i=k+2 a i a k+1 , for all n 0 , and k, n such that n 0 ≤ k ≤ n − 2. Fix n 0 and n ≥ n 0 + 1. Then, due to the previous relation, β n = e −λ n−1 i=n 0 +1 a i a n 0 .
Using this and l'Hôpital's rule, This proves the desired result.
Proof of Theorem 1.2. Observe that the bound in Lemma B.1 dominates that in Lemmas B.2 and B.3, respectively, for the cases µ = 1 and µ ∈ (0, 1). The desired result is now easy to see.