Open Access

A Concentration Bound for TD(0) with Function Approximation

Siddharth Chandak
Corresponding Author
Siddharth Chandak
[email protected]
https://orcid.org/0000-0003-3237-7729
Department of Electrical Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Vivek S. Borkar
Vivek S. Borkar
[email protected]
https://orcid.org/0000-0003-0756-5402
Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India
Search for more papers by this author

Siddharth Chandak

Corresponding Author

Siddharth Chandak

[email protected]

https://orcid.org/0000-0003-3237-7729

Department of Electrical Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Vivek S. Borkar

[email protected]

https://orcid.org/0000-0003-0756-5402

Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India

Search for more papers by this author

Published Online:26 Dec 2025https://doi.org/10.1287/stsy.2023.0055

Abstract

We derive uniform all-time concentration bound of the type ‘for all $n \geq n_{0}$ for some $n_{0}$ ’ for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm with both martingale and Markov noises. Markov noise is handled using the Poisson equation, and the lack of almost-sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.

Funding: The work of V. S. Borkar was supported in part by the S. S. Bhatnagar Fellowship from the Government of India.

1. Introduction

TD(0) is one of the most popular reinforcement learning (RL) algorithms for policy evaluation (Tsitsiklis and Van Roy 1997). Given a fixed policy, the algorithm is an iterative method to obtain the value function for each state under the long-term discounted reward framework. To mitigate the issues of large state spaces, the value function is often approximated using a linear combination of feature vectors. This algorithm is referred to as TD(0) with linear function approximation. In this paper, we work with online TD(0) with a single sample path of the underlying Markov chain. Our goal in this paper is to obtain a concentration bound of the form from some time on or, more precisely, for all $n \geq n_{0}$ for a suitably chosen $n_{0}$ for this algorithm.

A bound of this form was published for TD(0) as a section in our paper titled “Concentration of Contractive Stochastic Approximation and Reinforcement Learning” (Chandak et al. 2022). This paper established an all-time bound for contractive stochastic approximation with Markov noise and applied the bound to asynchronous Q-learning and TD(0). Although the main theorem and its application to asynchronous Q-learning are correct, TD(0) does not satisfy a key assumption for the main theorem,¹ and hence, the theorem was incorrectly applied to TD(0). We remove the need for that assumption in this version, giving a completely different proof tailored to the TD(0) algorithm.

The previous paper required the iterates of the stochastic approximation iteration to be bounded by a constant with probability 1. This assumption is not known to be true for the iterates of online TD(0) with function approximation for a single sample path. In fact, a common method to alleviate this issue is to project the iterates back into a ball centered around the origin (Bhandari et al. 2018, Patil et al. 2023). The key difficulty caused by the lack of this assumption is in applying martingale inequalities, which often require some restrictions on the increments of the martingale that are often not easy to verify. We do not modify the algorithm and instead adapt relaxed concentration inequalities (Chung and Lu 2006, section 8) for our problem. These bounds have an extra term given by the probability of increments going above a certain threshold (Tao and Vu 2015, proposition 34). Although the proof in this paper is restricted to TD(0), the underlying idea of using relaxed concentration inequalities is broadly applicable to other algorithms that face similar challenges because of unboundedness.

1.1. Related Works

There has been growing interest in analyzing the finite-time performance of reinforcement learning (RL) algorithms. Existing results can broadly be categorized by the type of bounds they establish. The most extensive body of work concerns expectation or mean square bounds (see, e.g., Chen et al. 2020, 2021). Another prominent line of research focuses on regret bounds, which characterize how the cumulative error grows over time—typically through almost-sure or expected regret guarantees (see, e.g., Azar et al. 2017, Jin et al. 2018, Yang and Wang 2019, Yang et al. 2020). A third class comprises high-probability or concentration bounds (see, e.g., Even-Dar and Mansour 2003, Qu and Wierman 2020, Li et al. 2023). Our work falls within this category but differs from conventional analyses that establish high-probability guarantees only for sufficiently large time n. In contrast, we derive uniform all-time bounds, that is, bounds that hold for all $n \geq n_{0}$ with probability at least $1 - δ$ .

Specifically for TD(0), moment bounds have been established in Bhandari et al. (2018), Srikant and Ying (2019), and Chen et al. (2021). High-probability bounds have been established under various modifications of the TD(0) algorithm. These include uniform sampling from data sets (Prashanth et al. 2021), projection and tail averaging (Patil et al. 2023), and oracle access to independent and identically distributed (i.i.d.) samples of state–action–next state triplets $(s, a, s^{'})$ (Dalal et al. 2018, Chen et al. 2025).

One of us considered stochastic approximation involving contractive maps and martingale noise and derived maximal concentration bounds for this class of algorithms (Borkar 2022). This covered, in particular, synchronous Q-learning for discounted cost and some related schemes. In Chandak et al. (2022), we extended this work to cover “Markov noise” (Meerkov 1972) in the stochastic approximation scheme, allowing us to give bounds for the asynchronous case. As mentioned before, this work assumed almost-sure boundedness of iterates, which is not satisfied by the TD(0) algorithm. We remove the need for this assumption in our current work. Other articles aiming at bounds of these forms, such as the one we provide, are found in Chandak et al. (2023) for the LSPE algorithm and Borkar (2002), Kamal (2010), and Thoppe and Borkar (2019) for abstract stochastic approximation schemes. A recent work (Chen et al. 2025) considers all-time bounds for iterates without almost-sure boundedness, but they only consider additive and multiplicative noise and not Markovian noise as considered in this paper. Their proof technique relies on Moreau envelopes and a bootstrapping technique, which differs significantly from our work.

1.2. Outline and Notation

The rest of the paper is structured as follows. Section 2 gives a background to the TD(0) algorithm, along with the required assumptions and the stochastic approximation formulation. Section 3 states the main result and provides some insights into the result. The result is proved in Section 4. A concluding section highlights some future directions. Appendix A states a martingale inequality used in our proof, and Appendix B gives proofs for some technical lemmas. which are used to prove the main theorem.

Throughout this work, $‖ \cdot ‖$ denotes the Euclidean norm on $R^{d}$ , and $〈 \cdot, \cdot 〉$ denotes the inner product in $R^{d}$ . $θ$ denotes the zero vector in $R^{d}$ . The $ℓ$ th component of a vector x and a vector-valued function $h (\cdot)$ are denoted by $x (ℓ)$ and $h^{ℓ} (\cdot)$ , respectively.

2. Background on TD(0)

TD(0) is an algorithm for policy evaluation, that is, for learning the performance of a fixed policy, and not for optimizing over policies. Hence, a stationary policy is fixed a priori, giving us a time-homogeneous uncontrolled Markov chain ${Y_{n}}$ over a finite state space $S$ . The transition probabilities are given by $p (\cdot | \cdot)$ , where the dependence on the policy is suppressed. Assume that the chain is aperiodic irreducible with the stationary distribution $π = [π (1), \dots, π (S)], S = | S |$ . Let D denote the $S \times S$ diagonal matrix whose sth diagonal entry is $π (s)$ . Reward $r (s)$ is received when a transition from state s takes place. Note that this reward can be stochastic as well, and the additional noise thereof can be combined with other noise terms without affecting our concentration result. For simplicity, we assume that we receive a deterministic $r (s)$ . The objective is to evaluate the long-term discounted reward for each state given by the value function

V (s) = E [\sum_{m = 0}^{\infty} γ^{m} r (X_{m}) | X_{0} = s], s \in S .

Here, $0 < γ < 1$ is the discount factor. The dynamic programming equation for evaluating the same is

V (s) = r (s) + γ \sum_{s^{'} \in S} p (s^{'} | s) V (s^{'}), s \in S .

This can be written as the following vector equation:

V = r + γ P V,

for

r = {[r (1), \dots, r (S)]}^{T}

and

P = {[[p (s^{'} | s)]]}_{s^{'}, s \in S} \in R^{S \times S}

The state space can often be large ( $S ≫ 1$ ), and to alleviate this “curse of dimensionality,” V is often approximated using a linear combination of d linearly independent basis functions (feature vectors) $ϕ_{i} \in R^{S}, 1 \leq i \leq d$ , with $S > > d \geq 1$ . Also, let $φ (s) = {[ϕ_{1} (s), \dots, ϕ_{d} (s)]}^{T}$ for $s \in S$ denote the vector comprising components corresponding to state s in each feature. Thus, $V (s) \approx \sum_{i = 1}^{d} x (i) ϕ_{i} (s)$ ; that is, $V (s) \approx x^{T} φ (s)$ and $V \approx Φ x$ , where $x = {[x (1), \dots, x (d)]}^{T}$ , and $Φ$ is an $S \times d$ matrix whose ith column is $ϕ_{i}$ . Here, x denotes the learnable weights for the linear function approximator. Because ${ϕ_{i}}$ are linearly independent, $Φ$ is full rank. Substituting this approximation into the dynamic programming equation above leads to

Φ x \approx r + γ P Φ x .

But the right-hand side (RHS) may not belong to the range of $Φ$ . So we use the following fixed point equation:

Φ x = Π (r + γ P Φ x) ≔ H (Φ x),

(1)

where

Π

denotes the projection to Range(

Φ

) with respect to a suitable norm. It turns out to be convenient to take a projection with respect to the weighted norm

‖ y ‖_{D} ≔ \sqrt{y^{T} D y} = {(\sum_{s \in S} π (s) {(y (s))}^{2})}^{1 / 2}

for

y \in R^{S}

. The projection map with respect to this norm is

Π y ≔ Φ {(Φ^{T} D Φ)}^{- 1} Φ^{T} D y .

The invertibility of $Φ^{T} D Φ$ is guaranteed by the fact that $Φ$ is full rank. Finally, the TD(0) algorithm is given by the recursion

x_{n + 1} = x_{n} + a (n) φ (Y_{n}) (r (Y_{n}) + γ φ {(Y_{n + 1})}^{T} x_{n} - φ {(Y_{n})}^{T} x_{n}) .

(2)

Here, $a (n)$ denotes the positive step-size sequence. At the end of this section, we explain how this iteration can be expected to converge to the required fixed point from (1).

2.1. Assumptions

We impose two assumptions on the algorithm. The first is about the feature vectors, and as we explain next, it does not restrict the algorithm. The second specifies the class of step sizes $a (n)$ considered, which is standard in the analysis of RL. In fact, our results hold for a broader class of step sizes than those typically required in stochastic approximation frameworks.

For the assumption on $Φ$ , define $Ψ ≔ Φ^{T} \sqrt{D}$ , and let $λ_{M}$ be the largest singular value of $Ψ$ , that is, the square of the largest eigenvalue of $Ψ Ψ^{T}$ and, equivalently, of $Ψ^{T} Ψ$ . Assume that
$λ_{M} < \frac{\sqrt{2 (1 - γ)}}{(1 + γ)} .$ (3)
Because the feature vectors can be scaled without affecting the algorithm (the weights $x (i)$ get scaled accordingly), this assumption does not restrict the algorithm. Alternatively, the step size can be appropriately scaled as well; that is, $a (n) = b (n) c$ , where $b (n)$ acts as the effective step size, and $c φ (\cdot)$ act as the effective feature vectors that satisfy the above assumption. This assumption can be replaced with the following assumption on the basis vectors:
$‖ φ (s) ‖ \leq \frac{\sqrt{2 (1 - γ)}}{1 + γ} \forall s \in S;$
that is, the $ℓ_{2}$ norm of each row of $Φ$ is bounded by $\sqrt{2 (1 - γ)} / (1 + γ)$ . To see this, note that
$\frac{‖ Ψ x ‖_{2}}{‖ x ‖_{2}} = \frac{‖ Φ x ‖_{D}}{‖ x ‖_{2}} \leq \frac{‖ Φ x ‖_{\infty}}{‖ x ‖_{2}} .$
Now, $\max_{x \neq θ} \frac{‖ Φ x ‖_{\infty}}{‖ x ‖_{2}}$ is the operator norm defined with $ℓ_{2}$ norm for domain and $ℓ_{\infty}$ for codomain, which is equal to the maximum $ℓ_{2}$ norm of a row. Hence,
$λ_{M} = \max_{x \neq θ} \frac{‖ Ψ x ‖_{2}}{‖ x ‖_{2}} \leq \max_{x \neq θ} \frac{‖ Φ x ‖_{\infty}}{‖ x ‖_{2}} = \max_{s \in S} ‖ φ (s) ‖_{2} .$
${a (n)}$ is a sequence of nonnegative step sizes satisfying the conditions
$a (n) \to 0, \sum_{n} a (n) = \infty$

and is assumed to be nonincreasing; that is, $a (n + 1) \leq a (n) \forall n$ . We also assume that $a (n) < 1$ for all n. We further assume that $\frac{d_{1}}{n + 1} \leq a (n) \leq d_{3} {(\frac{1}{n + 1})}^{d_{2}}, \forall n$ , where $d_{1} > 0$ and $0 < d_{2} \leq 1$ . Larger values of $d_{1}$ and $d_{2}$ and smaller values of $d_{3}$ improve the main result presented below. The role this assumption plays in our bounds will become clear later. Observe that we do not require the classical square-summability condition in stochastic approximation, namely, $\sum_{n} a {(n)}^{2} < \infty$ . This is because the contractive nature of our iterates (Lemma 1) gives us an additional handle on errors by putting less weight on past errors. A similar effect was observed in Chandak et al. (2022). The above assumptions on the step-size sequence can be weakened so as to hold only after some $N > 1$ without any changes in the analysis.

2.2. Formulation as a Stochastic Approximation Iteration

We next rearrange Algorithm (2) to separate the martingale noise and the Markov noise, and we write it as a stochastic approximation iteration:

\begin{array}{l} x_{n + 1} & = x_{n} + a (n) φ (Y_{n}) (r (Y_{n}) + γ φ {(Y_{n + 1})}^{T} x_{n} - φ {(Y_{n})}^{T} x_{n}) \\ = x_{n} + a (n) (F (x_{n}, Y_{n}) - x_{n} + M_{n + 1} x_{n}) \\ = x_{n} + a (n) (\sum_{s \in S} π (s) F (x_{n}, s) - x_{n}) + \underset{τ_{1}}{\underset{︸}{a (n) M_{n + 1} x_{n}}} + \underset{τ_{2}}{\underset{︸}{a (n) (F (x_{n}, Y_{n}) - \sum_{s \in S} π (s) F (x_{n}, s))}}, \end{array}

where

F (x, Y) = φ (Y) r (Y) + γ φ (Y) \sum_{s^{'} \in S} p (s^{'} | Y) φ {(s^{'})}^{T} x - φ (Y) φ {(Y)}^{T} x + x,

and

M_{n + 1} = γ φ (Y_{n}) (φ {(Y_{n + 1})}^{T} - \sum_{s^{'} \in S} p (s^{'} | Y_{n}) φ {(s^{'})}^{T}) .

Define the family of $σ$ -fields $F_{n} ≔ σ (x_{0}, Y_{m}, m \leq n), n \geq 0$ . Then, ${M_{n} x_{n - 1}}$ is a martingale difference sequence with respect to ${F_{n}}$ ; that is,

E [M_{n + 1} x_{n} | F_{n}] = θ, a . s . \forall n,

where

θ

denotes the zero vector. The term

τ_{1}

denotes the error because of the martingale noise term, and term

τ_{2}

denotes the error because of the Markov noise

{Y_{n}}

The following lemma shows that the function $\sum_{s \in S} π (s) F (\cdot, s)$ is a contraction. Let $〈 x, x^{'} 〉 = x^{T} x^{'}$ and ${〈 x, z 〉}_{D} = x^{T} D z$ . Whereas the contraction property of the TD(0) algorithm is well-known (Tsitsiklis and Van Roy 1997), we obtain an explicit expression for the contraction factor.

Lemma 1.

For any $x, z \in R^{d}$ ,

‖ \sum_{s \in S} π (s) (F (x, s) - F (z, s)) ‖ \leq α ‖ x - z ‖,

where

α = \sqrt{1 - \min_{x \neq θ} \frac{‖ Φ x ‖_{D}^{2}}{‖ x ‖^{2}} (2 (1 - γ) - λ_{M}^{2} {(1 + γ)}^{2})} .

Moreover, $0 < α < 1$ , and hence, the function $\sum_{s \in S} π (s) F (\cdot, s)$ is a contraction.

The proof appears in Appendix B. The Banach contraction mapping theorem implies that $\sum_{s \in S} π (s) F (\cdot, s)$ has a unique fixed point $x^{*}$ ; that is, there exists a unique point $x^{*}$ such that $\sum_{s \in S} π (s) F (x^{*}, s) = x^{*}$ . We next show that the fixed point $x^{*}$ is the required fixed point we wish to converge to. Before that, we first observe that

\sum_{s \in S} π (s) F (x, s) = (Φ^{T} D r + γ Φ^{T} D P Φ - Φ^{T} D Φ + I) x .

Then,

\begin{array}{l} \sum_{s \in S} π (s) F (x^{*}, s) = (Φ^{T} D r + γ Φ^{T} D P Φ - Φ^{T} D Φ + I) x^{*} = x^{*} \\ \Rightarrow (Φ^{T} D r + γ Φ^{T} D P Φ) x^{*} = Φ^{T} D Φ x^{*} \\ \Rightarrow {(Φ^{T} D Φ)}^{- 1} (Φ^{T} D r + γ Φ^{T} D P Φ) x^{*} = x^{*} \\ \Rightarrow Φ {(Φ^{T} D Φ)}^{- 1} Φ^{T} D (r + γ P Φ) x^{*} = Φ x^{*} \\ \Rightarrow H (Φ x^{*}) = Φ x^{*} . \end{array}

So $Φ x^{*}$ is the required fixed point of (1).

3. Main Result

Before stating the main result, we define the following two sequences. For $n \geq 0$ ,

\begin{array}{l} b_{k} (n) & = \sum_{m = k}^{n} a (m), 0 \leq k \leq n < \infty, \\ β_{k} (n) & = {\begin{array}{l} \frac{1}{k^{d_{2} - d_{1}} n^{d_{1}}}, & if d_{1} \leq d_{2} \\ \frac{1}{n^{d_{2}}}, & otherwise . \end{array} \end{array}

Our main result is as follows:

Theorem 1.

There exist finite positive constants $c_{1}, c_{2}$ and D such that for $0 < δ \leq 1, 0 < ϵ \leq 1$ , $n_{0} > 0$ large enough to satisfy $α + a (n_{0}) c_{1} < 1$ , and $n \geq n_{0}$ ,

the inequality
$‖ x_{m} - x^{*} ‖ \leq e^{- (1 - α) b_{n_{0}} (m - 1)} ϵ + \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}, \forall n_{0} \leq m \leq n,$
holds with probability exceeding
$1 - 2 d \sum_{m = n_{0} + 1}^{n} e^{- D δ^{2} / β_{n_{0}} (m - 1)} - P (‖ x_{n_{0}} - x^{*} ‖ > ϵ) .$
In particular,
$‖ x_{m} - x^{*} ‖ \leq e^{- (1 - α) b_{n_{0}} (m - 1)} ϵ + \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}, \forall m \geq n_{0},$

holds with probability exceeding

1 - 2 d \sum_{m \geq n_{0} + 1} e^{- D δ^{2} / β_{n_{0}} (m - 1 n)} - P (‖ x_{n_{0}} - x^{*} ‖ > ϵ) .

The following are some remarks about the theorem and the proof that follows.

Remark 1.

The assumption that $δ \leq 1$ and $ϵ \leq 1$ is used in the proof for Lemma 3 and has been made only for simplicity. These can be taken as any positive values, with changes required only in the constant D.

Remark 2.

The term $P (‖ x_{n_{0}} - x^{*} ‖ > ϵ)$ captures the unavoidable contribution of the initial condition at $n_{0}$ . This can be bounded by combining moment bounds (Bhandari et al. 2018, Srikant and Ying 2019, Chen et al. 2021) with Markov’s inequality.

Remark 3.

In Chen et al. (2025), an all-time bound is obtained, which goes to zero as $m ↑ \infty$ . In our bound, the term $\frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}$ remains constant as m is increased. Here, $δ$ can be modified to $δ (m)$ (similar to the treatment in Chandak et al. 2022, corollary 1). But the term $a (n_{0}) (c_{2} + c_{1} ϵ)$ arises from our treatment of Markov noise using the Poisson equation. Note that Chen et al. (2025) do not consider the case of Markov noise, but only consider the case of additive and multiplicative noise (i.i.d. samples of state-action-next state triplets). We leave incorporating their ideas into our approach to get a bound decaying with m for Markov noise as future work.

Remark 4.

For the special case of $a (n) = \frac{d_{1}}{n + 1}$ , we combine our result with Chen et al. (2021, theorem 2.1), a mean square error bound, to get the following corollary.

Corollary 1.

Let $a (n) = d_{1} / (n + 1)$ with sufficiently large $d_{1}$ . Let $n_{0}$ be large enough to satisfy assumptions of Theorem 1 and Chen et al. (2021, theorem 2.1). Then, with probability at least $1 - ε_{1} - ε_{2}$ , we have, for all $m \geq n_{0}$ , that

‖ x_{m} - x^{*} ‖ = O (\frac{1}{\sqrt{n_{0}}} \log^{1 / 2} (\frac{1}{ε_{1}}) + \sqrt{\frac{\log (n_{0})}{n_{0}}} \frac{1}{\sqrt{ε_{2}}} (\frac{n_{0}}{m} + \frac{1}{n_{0}})) .

The proof for this corollary has been presented at the end of Appendix B. The first term here corresponds to the term $δ$ in Theorem 1. This term has a $\sqrt{1 / n_{0}}$ decay rate and an exponentially small tail. The second term is the contribution of the initial condition at $n_{0}$ . We have a polynomial tail in this case, but the dependence on m and $n_{0}$ is stronger, as the term $(\sqrt{\log (n_{0})} / \sqrt{n_{0}}) \times (n_{0} / m)$ decays with m, and the other term decays as $\log^{1 / 2} (n_{0}) n_{0}^{- 3 / 2}$ .

4. Proof of the Main Result

We present the proof of the main theorem in this section. The key martingale concentration inequality used in our proof is stated in Appendix A, and proofs for the technical lemmas used in the proof are presented in Appendix B.

Proof of Theorem 1.

Define $z_{n}$ for $n \geq n_{0}$ by

z_{n + 1} = z_{n} + a (n) (\sum_{s} π (s) F (z_{n}, s) - z_{n}),

where

z_{n_{0}} = x_{n_{0}}

. Note that

‖ x_{n} - x^{*} ‖ \leq ‖ x_{n} - z_{n} ‖ + ‖ z_{n} - x^{*} ‖

. To bound the second term, note that

\begin{array}{l} z_{n + 1} - x^{*} & = (1 - a (n)) (z_{n} - x^{*}) + a (n) (\sum_{s \in S} π (s) F (z_{n}, s) - x^{*}) \\ = (1 - a (n)) (z_{n} - x^{*}) + a (n) \sum_{s \in S} π (s) (F (z_{n}, s) - F (x^{*}, s)) . \end{array}

The second equality follows from the fact that $x^{*}$ is a fixed point for $\sum_{s \in S} π (s) F (\cdot, s)$ . Then,

\begin{array}{l} ‖ z_{n + 1} - x^{*} ‖ & \leq (1 - a (n)) ‖ z_{n} - x^{*} ‖ + a (n) ‖ \sum_{s \in S} π (s) (F (z_{n}, s) - F (x^{*}, s)) ‖ \\ \leq (1 - (1 - α) a (n)) ‖ z_{n} - x^{*} ‖ . \end{array}

which finally gives us

‖ z_{n} - x^{*} ‖ \leq \prod_{k = n_{0}}^{n - 1} (1 - (1 - α) a (k)) ‖ x_{n_{0}} - x^{*} ‖ \leq e^{- (1 - α) b_{n_{0}} (n - 1)} ‖ x_{n_{0}} - x^{*} ‖,

(4)

This also implies that for all $n \geq n_{0}$ ,

‖ z_{n} ‖ \leq ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖ .

(5)

Next, we give a probabilistic bound on the term $‖ x_{n} - z_{n} ‖$ . Note that

\begin{array}{l} x_{n + 1} - z_{n + 1} & = (1 - a (n)) (x_{n} - z_{n}) + a (n) M_{n + 1} x_{n} + a (n) (F (x_{n}, Y_{n}) - \sum_{s} π (s) F (z_{n}, s)) \\ = (1 - a (n)) (x_{n} - z_{n}) + a (n) M_{n + 1} x_{n} + a (n) (\sum_{s \in S} π (s) (F (x_{n}, s) - F (z_{n}, s))) \\ + a (n) (F (x_{n}, Y_{n}) - \sum_{s \in S} π (s) F (x_{n}, s)) . \end{array}

For $n, m \geq 0$ , let $χ (n, m) = \prod_{k = m}^{n} (1 - a (k))$ if $n \geq m$ , and one otherwise. For some $n \geq n_{0}$ , we iterate the above for $n_{0} \leq m \leq n$ to obtain

\begin{array}{l} x_{m + 1} - z_{m + 1} & = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) M_{k + 1} x_{k} \\ + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (\sum_{s \in S} π (s) (F (x_{k}, s) - F (z_{k}, s))) \\ + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (F (x_{k}, Y_{k}) - \sum_{s \in S} π (s) F (x_{k}, s)) . \end{array}

(6)

Here, we use the definition that $x_{n_{0}} = z_{n_{0}}$ . We first simplify the third term above. For simplicity, we define $F (x, Y) = F_{1} (Y) + F_{2} (Y) x + x,$ where

F_{1} (Y) = φ (Y) r (Y) \in R^{d} and F_{2} (Y) = (γ φ (Y) \sum_{s^{'} \in S} p (s^{'} | Y) φ {(s^{'})}^{T} - φ (Y) φ {(Y)}^{T}) \in R^{d \times d} .

We define $U : S \mapsto R^{d}$ to be a solution of the Poisson equation

U (s) = F_{1} (s) - \sum_{s^{'} \in S} π (s^{'}) F_{1} (s^{'}) + \sum_{s^{'} \in S} p (s^{'} | s) U (s^{'}), s \in S .

(7)

For $s_{0} \in S$ , $τ ≔ \min {n > 0 : Y_{n} = s_{0}}$ , and $E_{s} [\dots] = E [\dots | Y_{0} = s]$ ; we know that

U^{'} (s) = E_{s} [\sum_{m = 0}^{τ - 1} (F_{1} (Y_{m}) - \sum_{s^{'} \in S} π (s^{'}) F_{1} (s^{'}))], s \in S

(8)

is one particular solution to the Poisson equation (see, e.g., Borkar 1991, pp. 85–91, section VI.4, lemma 4.2 and theorem 4.2). Thus,

‖ U^{'} (s) ‖_{\infty} \leq 2 \max_{s \in S} ‖ F_{1} (s) ‖_{\infty} E_{s} [τ]

. For an irreducible Markov chain with a finite state space,

E_{s} [τ]

is finite for all s, and hence, the solution

U^{'} (s)

is bounded for all s. For each

ℓ

, the Poisson equation specifies

U^{ℓ} (\cdot)

uniquely only up to an additive constant. Along with the additional constraint that

U (s_{0}) = 0

for a prescribed

s_{0} \in S

, the system of equations given by (7) has a unique solution. Henceforth, U refers to the unique solution of the Poisson equation with

U (s_{0}) = 0

. Similarly, let

W : S \mapsto R^{d \times d}

be the unique solution of the Poisson equation

W (s) = F_{2} (s) - \sum_{s^{'}} π (s^{'}) F_{2} (s^{'}) + \sum_{s^{'}} p (s^{'} | s) W (s), s \in S,

(9)

with the additional constraint that

W (s_{0}) = 0

for a prescribed

s_{0} \in S

as above.

□

The following lemma gives a simplification of the third term in (6), using the solutions of the Poisson equation stated above. Before stating the lemma, we first define $x_{m}^{'} = \sup_{n_{0} \leq k \leq m} ‖ x_{m} - z_{m} ‖$ .

Lemma 2.

There exist positive constants $c_{1}, c_{2}$ such that for all $n_{0} \leq m \leq n$ ,

\begin{array}{l} \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (F (x_{k}, Y_{k}) - \sum_{s \in S} π (s) F (x_{k}, s)) \\ = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) ({\tilde{U}}_{k + 1} + {\tilde{W}}_{k + 1} x_{k}) + μ_{m} (n_{0}), \end{array}

where

‖ μ_{m} (n_{0}) ‖ \leq a (n_{0}) (c_{2} + c_{1} x_{m}^{'} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) .

Here, ${\tilde{U}}_{k + 1}$ and ${\tilde{W}}_{k + 1} x_{k}$ are martingale difference sequences with respect to $F_{k}$ , where ${\tilde{U}}_{k + 1} = U (Y_{k + 1}) - \sum_{s^{'}} p (s^{'} | Y_{k}) U (s^{'})$ and ${\tilde{W}}_{k + 1} = W (Y_{k + 1}) - \sum_{s^{'}} p (s^{'} | Y_{k}) W (s^{'})$ for $k \geq n_{0}$ and the zero vector, respectively, or the zero matrix, otherwise.

The proof appears in Appendix B. Returning to (6), we now have

\begin{array}{l} x_{m + 1} - z_{m + 1} & = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (\sum_{s \in S} π (s) (F (x_{k}, s) - F (z_{k}, s))) \\ + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) + μ_{m} (n_{0}) . \end{array}

Now,

\begin{array}{l} ‖ x_{m + 1} - z_{m + 1} ‖ & \leq ‖ \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (\sum_{s \in S} π (s) (F (x_{k}, s) - F (z_{k}, s))) ‖ \\ + ‖ \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) ‖ \\ + a (n_{0}) (c_{2} + c_{1} x_{m}^{'} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) \\ \leq α \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) ‖ x_{k} - z_{k} ‖ + a (n_{0}) (c_{2} + c_{1} x_{m}^{'} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) \\ + ‖ \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) ‖ . \end{array}

(10)

For any $0 < k \leq m$ ,

χ (m, k) + χ (m, k + 1) a (k) = χ (m, k + 1),

and hence,

χ (m, n_{0}) + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) = χ (m, m + 1) = 1 .

This implies that

\sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) \leq 1 .

Using the definition of $x_{m}^{'}$ , we have

\begin{array}{l} x_{m + 1}^{'} \leq & (α + a (n_{0}) c_{1}) x_{m}^{'} + ‖ \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) ‖ + a (n_{0}) (c_{2} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) . \end{array}

(11)

Next, we wish to obtain a bound on the probability

P (‖ x_{m} - x^{*} ‖ \leq \exp (- (1 - α) b_{n_{0}} (m - 1)) ϵ + Δ (n_{0}, ϵ, δ), \forall n_{0} \leq m \leq n),

for some

ϵ > 0

and

δ > 0

(recall the assumption that

α + a (n_{0}) c_{1} < 1

). For ease of notation, here, we have defined

Δ (n_{0}, ϵ, δ) ≔ \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} .

From (4), recall that $‖ z_{n} - x^{*} ‖ \leq \exp (- (1 - α) b_{n_{0}} (n - 1)) ‖ x_{n_{0}} - x^{*} ‖ a . s .,$ and hence,

‖ x_{n_{0}} - x^{*} ‖ \leq ϵ \Rightarrow ‖ z_{m} - x^{*} ‖ \leq \exp (- (1 - α) b_{n_{0}} (m - 1)) ϵ .

Also recall that $\sup_{n_{0} \leq m \leq n} ‖ x_{m} - z_{m} ‖ = x_{n}^{'}$ . Hence,

\begin{array}{l} {‖ x_{n_{0}} - x^{*} ‖ \leq ϵ} ⋂ {x_{n}^{'} \leq Δ (n_{0}, ϵ, δ)} \subseteq {‖ x_{m} - x^{*} ‖ \leq \exp (- (1 - α) b_{n_{0}} (m - 1)) ϵ + Δ (n_{0}, ϵ, δ), \forall n_{0} \leq m \leq n} . \end{array}

This implies the following relation between the probabilities of the two sets:

\begin{array}{l} P (‖ x_{m} - x^{*} ‖ \leq \exp (- (1 - α) b_{n_{0}} (m - 1)) ‖ x_{n_{0}} - x^{*} ‖ + Δ (n_{0}, ϵ, δ), \forall n_{0} \leq m \leq n) \\ \geq 1 - P ({x_{n}^{'} > Δ (n_{0}, ϵ, δ)} ⋃ {‖ x_{n_{0}} - x^{*} ‖ > ϵ}) . \end{array}

To compensate for the lack of an almost-sure bound on the iterates ${x_{n}}$ , we adapt the proof method from Tao and Vu (2015, proposition 34) (see Chung and Lu 2006, section 8, for a detailed explanation). For this, we define $ξ = {x_{0}, Y_{k}, k \geq 0}$ and the “bad” set $B_{m}$ as

B_{m} = {ξ | x_{m}^{'} (ξ) > Δ (n_{0}, ϵ, δ) ⋃ ‖ x_{n_{0}} (ξ) - x^{*} ‖ > ϵ} .

Here, the notation $x_{m}^{'} (ξ)$ and $x_{n_{0}} (ξ)$ highlights the dependence of $x_{m}^{'}$ and $x_{n_{0}}$ on the realizations of $x_{0}$ and ${Y_{k}}$ . Analogous notation is used for other random variables. For $ξ \notin B_{n - 1}$ , let us define ${\bar{x}}_{k, n - 1} (ξ) = x_{k} (ξ)$ and ${\bar{z}}_{k, n - 1} (ξ) = z_{k} (ξ)$ for all k. For $ξ \in B_{n - 1}$ , we define ${\bar{x}}_{k, n - 1} (ξ) = x^{*}$ and ${\bar{z}}_{k, n - 1} (ξ) = x^{*}$ for all k. Also, define ${\bar{x}}_{m, n - 1}^{'} (ξ) = \sup_{n_{0} \leq k \leq m} ‖ {\bar{x}}_{k, n - 1} (ξ) - {\bar{z}}_{k, n - 1} (ξ) ‖$ . Note that ${\bar{x}}_{m, n - 1}^{'} = 0$ when $ξ \in B_{n - 1}$ , and ${\bar{x}}_{m, n - 1}^{'} = x_{m}^{'} \leq Δ (n_{0}, ϵ, δ)$ when $ξ \notin B_{n - 1}$ . The intuition behind these definitions is that ${\bar{x}}_{m, n - 1}^{'}$ is always bounded by $Δ (n_{0}, ϵ, δ)$ for all $m \leq n - 1$ .

Note that $ξ \notin B_{n - 1} \Rightarrow {\bar{x}}_{n, n - 1}^{'} (ξ) = x_{n}^{'} (ξ)$ , which implies $P ({\bar{x}}_{n, n - 1}^{'} (ξ) \neq x_{n}^{'} (ξ)) \leq P (B_{n - 1}) .$ Henceforth, we drop $ξ$ for ease of notation, rendering implicit the dependence of all random variables on $ξ$ . Then,

\begin{array}{l} P (x_{n}^{'} > Δ (n_{0}, ϵ, δ)) & \leq P (x_{n}^{'} > Δ (n_{0}, ϵ, δ) ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) \\ \overset{(a)}{\leq} P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ) ⋃ {\bar{x}}_{n, n - 1}^{'} \neq x_{n}^{'} ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) \\ \overset{(b)}{\leq} P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ)) + P ({\bar{x}}_{n, n - 1}^{'} \neq x_{n}^{'} ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) \\ \overset{(c)}{\leq} P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ)) + P (B_{n - 1}) \\ = P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ)) + P (x_{n - 1}^{'} > Δ (n_{0}, ϵ, δ) ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) . \end{array}

Inequality (a) here follows from the observation that

{{\bar{x}}_{n, n - 1}^{'} \leq Δ (n_{0}, ϵ, δ)} \cap {{\bar{x}}_{n, n - 1}^{'} = x_{n}^{'}} \subseteq {x_{n}^{'} \leq Δ (n_{0}, ϵ, δ)},

which implies that

{x_{n}^{'} > Δ (n_{0}, ϵ, δ)} \subseteq {{\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ)} ⋃ {{\bar{x}}_{n, n - 1}^{'} \neq x_{n}^{'}},

which gives us the required inequality. Inequality (b) follows from union bound, and inequality (c) follows from the observations that

{‖ x_{n_{0}} - x^{*} ‖ > ϵ} \subseteq B_{n - 1}

and

{{\bar{x}}_{n, n - 1}^{'} \neq x_{n}^{'}} \subseteq B_{n - 1}

Now, we obtain a bound for $P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ))$ by induction. We first note that ${\bar{x}}_{n - 1, n - 1}^{'}$ is bounded by $Δ (n_{0}, ϵ, δ)$ by definition. Hence, $P ({\bar{x}}_{n, n - 1}^{'} > Δ (n_{0}, ϵ, δ)) = P (‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ > Δ (n_{0}, ϵ, δ))$ . We first restate (11) for $m = n - 1$ .

\begin{array}{l} ‖ x_{n} - z_{n} ‖ \leq x_{n}^{'} \leq & (α + a (n_{0}) c_{1}) x_{n - 1}^{'} + a (n_{0}) (c_{2} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) \\ + ‖ \sum_{k = n_{0}}^{n - 1} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) ‖ . \end{array}

Now, let $I {\cdot}$ denote the indicator function, which is one when ${\cdot}$ holds true, and zero otherwise.

\begin{array}{l} ‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ \\ = ‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ I {ξ_{n - 1} \in B_{n - 1}} + ‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ I {ξ_{n - 1} \notin B_{n - 1}} \\ \overset{(a)}{=} 0 \times I {ξ_{n - 1} \in B_{n - 1}} + ‖ x_{n} - z_{n} ‖ \times I {ξ_{n - 1} \notin B_{n - 1}} \\ \leq I {ξ_{n - 1} \notin B_{n - 1}} \times ((α + a (n_{0}) c_{1}) x_{n - 1}^{'} + a (n_{0}) (c_{2} + c_{1} ‖ x_{n_{0}} - x^{*} ‖) \\ + ‖ \sum_{k = n_{0}}^{n - 1} χ (m, k + 1) a (k) (M_{k + 1} x_{k} + {\tilde{W}}_{k + 1} x_{k} + {\tilde{U}}_{k + 1}) ‖) \\ \overset{(b)}{\leq} I {ξ_{n - 1} \notin B_{n - 1}} \times ((α + a (n_{0}) c_{1}) Δ (n_{0}, ϵ, δ) + a (n_{0}) (c_{2} + c_{1} ϵ) \\ + ‖ \sum_{k = n_{0}}^{n - 1} χ (n - 1, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{U}}_{k + 1}) ‖) . \end{array}

Here, inequality (a) follows from our definition of $B_{n - 1}$ that $‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ = 0$ when $ξ_{n - 1} \in B_{n - 1}$ and $x_{n} = {\bar{x}}_{n, n - 1}$ when $ξ_{n - 1} \notin B_{n - 1}$ . Inequality (b) follows from the fact that when $ξ_{n - 1} \notin B_{n - 1}$ , then $x_{k} = {\bar{x}}_{k, n - 1}$ for all k, and $‖ x_{n_{0}} - x^{*} ‖ \leq ϵ$ . Substituting the expression for $Δ (n_{0}, ϵ, δ)$ , we obtain the following:

\begin{array}{l} ‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ & \leq (α + a (n_{0}) c_{1}) \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} + a (n_{0}) (c_{2} + c_{1} ϵ) \\ + ‖ \sum_{k = n_{0}}^{n - 1} χ (n - 1, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{U}}_{k + 1}) ‖ \\ \leq \frac{a (n_{0}) (c_{2} + c_{1} ϵ)}{1 - α - a (n_{0}) c_{1}} + \frac{α + a (n_{0}) c_{1}}{1 - α - a (n_{0}) c_{1}} δ \\ + ‖ \sum_{k = n_{0}}^{n - 1} χ (n - 1, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{U}}_{k + 1}) ‖ . \end{array}

When

‖ \sum_{k = n_{0}}^{n - 1} χ (n - 1, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{U}}_{k + 1}) ‖ \leq δ,

we have

‖ {\bar{x}}_{n, n - 1} - {\bar{z}}_{n, n - 1} ‖ \leq \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} .

Hence,

\begin{array}{l} P ({\bar{x}}_{n, n - 1}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}) \leq P (‖ \sum_{k = n_{0}}^{n - 1} χ (n - 1, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, n - 1} + {\tilde{U}}_{k + 1}) ‖ > δ) . \end{array}

Let us denote the probability on the right side of the inequality as $p_{n - 1}$ . Then,

P (x_{n}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}) \leq p_{n - 1} + P (x_{n - 1}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) .

Then, repeating the same procedure using $B_{n - 2}$ , we obtain

\begin{array}{l} P (x_{n - 1}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) \\ \leq p_{n - 2} + P (x_{n - 2}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} ⋃ ‖ x_{n_{0}} - x^{*} ‖ > ϵ) . \end{array}

Iterating this for $n \geq m \geq n_{0} + 1$ , we get

P (x_{n}^{'} > \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}) \leq \sum_{m = n_{0} + 1}^{n} p_{m - 1} + P (‖ x_{n_{0}} - x^{*} ‖ > ϵ) .

The probabilities $p_{m}$ can be bounded using standard martingale inequalities as the terms of the martingale difference sequence are almost surely bounded. The following lemma, proved in Appendix B, gives a bound on the probabilities $p_{m}$ :

Lemma 3.

There exists positive constant D such that for $0 < ϵ \leq 1, 0 < δ \leq 1$ ,

p_{m} \leq 2 d e^{- D δ^{2} / β_{n_{0}} (m)} .

Recall that d here denotes the dimension of the iterates ${x_{n}}$ .

This completes the proof for the first part of Theorem 1.

Let $A_{n}$ be the set

{‖ x_{m} - x^{*} ‖ \leq e^{- (1 - α) b_{n_{0}} (m - 1)} ϵ + \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}, \forall n_{0} \leq m \leq n} .

Then, ${A_{n}}$ is a decreasing sequence of sets; that is, $A_{n + 1} \subseteq A_{n}$ for all $n \geq n_{0}$ . Now, let A be the set

{‖ x_{m} - x^{*} ‖ \leq e^{- (1 - α) b_{n_{0}} (m - 1)} ϵ + \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}, \forall m \geq n_{0}} .

Then, $A = \cap_{n = n_{0}}^{\infty} A_{n}$ . Hence, $P (A) = \lim_{n ↑ \infty} P (A_{n})$ . This completes the proof for Theorem 1.

5. Conclusions

In conclusion, we note some future directions. The concept of relaxed martingale concentration inequalities can be used to obtain bounds of the similar flavor for algorithms that suffer from similar issues. These include TD( $λ$ ) and SSP Q-Learning. Alternatively, similar bounds can be obtained for variants of temporal difference learning (Chen et al. 2021). Another direction could be to improve the bounds in this paper to get an exponentially small tail for Markovian stochastic approximation.

Appendix A. A Martingale Inequality

Let ${M_{n}}$ be a real-valued martingale difference sequence with respect to an increasing family of $σ$ -fields ${F_{n}}$ . Assume that there exist $ε, C > 0$ such that

E [e^{ε | M_{n} |} | F_{n - 1}] \leq C \forall n \geq 1 a . s .

Let $S_{n} ≔ \sum_{m = 1}^{n} ζ_{m, n} M_{m}$ , where $ζ_{m, n}, m \leq n,$ for each n, are a.s. bounded ${F_{n}}$ -previsible random variables; that is, $ζ_{m, n}$ is $F_{m - 1}$ -measurable $\forall m \geq 1$ , and $| ζ_{m, n} | \leq A_{m, n}$ a.s. for some constant $A_{m, n}$ , $\forall m, n$ . Suppose

\sum_{m = 1}^{n} A_{m, n} \leq γ_{1}, \max_{1 \leq m \leq n} A_{m, n} \leq γ_{2} ω (n),

for some

γ_{i}, ω (n) > 0, i = 1, 2; n \geq 1

. Then, we have

Theorem A.1.

There exists a constant $D > 0$ depending on $ε, C, γ_{1}, γ_{2}$ such that for $ϵ > 0$ ,

P (| S_{n} | > ϵ) \leq 2 e^{- \frac{D ϵ^{2}}{ω (n)}}, if ϵ \in (0, \frac{C γ_{1}}{ε}],

(A.1)

2 e^{- \frac{D ϵ}{ω (n)}} otherwise .

(A.2)

This is a variant of Liu and Watbled (2009, theorem 1.1). See Thoppe and Borkar (2019, pp. 21–23, theorem A.1) for details.

Appendix B. Technical Proofs

B.1. Proof of Lemma 1

Proof.

\begin{array}{l} ‖ \sum_{s \in S} π (s) (F (x, s) - F (z, i)) ‖^{2} = ‖ γ \sum_{s \in S} π (s) φ (s) \sum_{s^{'} \in S} p (s^{'} | s) φ {(s^{'})}^{T} (x - z) - \sum_{s \in S} π (s) φ (s) φ {(s)}^{T} (x - z) + (x - z) ‖^{2} \\ = ‖ (γ Φ^{T} D P Φ - Φ^{T} D Φ + I) (x - z) ‖^{2} \\ = ‖ (γ Φ^{T} D P Φ - Φ^{T} D Φ) (x - z) ‖^{2} + {(x - z)}^{T} (x - z) - 2 {(x - z)}^{T} Φ^{T} D Φ (x - z) \\ + {(x - z)}^{T} (γ Φ^{T} D P Φ + γ Φ^{T} P^{T} D Φ) (x - z) . \end{array}

(B.1)

Now,

\begin{array}{l} {(x - z)}^{T} (γ Φ^{T} D P Φ + γ Φ^{T} P^{T} D Φ) (x - z) & = {(x - z)}^{T} γ Φ^{T} (D P + P^{T} D) Φ (x - z) \\ = γ {〈 Φ (x - z), P Φ (x - z) 〉}_{D} + γ {〈 P Φ (x - z), Φ (x - z) 〉}_{D} \\ \overset{(a)}{\leq} 2 γ ‖ P Φ (x - z) ‖_{D} ‖ Φ (x - z) ‖_{D} \\ \overset{(b)}{\leq} 2 γ ‖ Φ (x - z) ‖_{D}^{2}, \end{array}

(B.2)

and

\begin{array}{l} 2 {(x - z)}^{T} Φ^{T} D Φ (x - z) = 2 {〈 Φ (x - z), Φ (x - z) 〉}_{D} = 2 ‖ Φ (x - z) ‖_{D}^{2} . \end{array}

(B.3)

Inequality (a) follows from the Cauchy–Schwarz inequality, and (b) follows from the observation that $‖ P y ‖_{D} \leq ‖ y ‖_{D}$ , which can be proved as follows:

‖ P y ‖_{D}^{2} = \sum_{s \in S} π (s) {(\sum_{s^{'} \in S} p (s^{'} | s) y (s^{'}))}^{2} \leq \sum_{s \in S} π (s) \sum_{s^{'} \in S} p (s^{'} | s) y {(s^{'})}^{2} = \sum_{s^{'} \in S} π (s^{'}) y {(s^{'})}^{2} = ‖ y ‖_{D}^{2} .

Here, the inequality follows from Jensen’s inequality.

Combining (B.2) and (B.3) with (B.1) gives us

\begin{array}{l} ‖ \sum_{s \in S} π (s) (F (x, s) - F (z, s)) ‖^{2} \leq ‖ x - z ‖^{2} - 2 (1 - γ) ‖ Φ (x - z) ‖_{D}^{2} + ‖ (γ Φ^{T} D P Φ - Φ^{T} D Φ) (x - z) ‖^{2} . \end{array}

(B.4)

To analyze the last term in (B.4), we use the fact that the operator norm of a matrix defined as $‖ M ‖ = \sup_{x \neq θ} \frac{‖ M x ‖}{‖ x ‖}$ , using the Euclidean norm for vectors, is equal to the largest singular value of that matrix. Thus,

\begin{array}{l} ‖ (γ Φ^{T} D P Φ - Φ^{T} D Φ) (x - z) ‖^{2} & = ‖ Φ^{T} \sqrt{D} (γ \sqrt{D} P Φ - \sqrt{D} Φ) (x - z) ‖^{2} \\ \leq λ_{M}^{2} ‖ (γ \sqrt{D} P Φ - \sqrt{D} Φ) (x - z) ‖^{2} \\ = λ_{M}^{2} {〈 (γ P - I) Φ (x - z), (γ P - I) Φ (x - z) 〉}_{D} \\ = λ_{M}^{2} ‖ (I - γ P) Φ (x - z) ‖_{D}^{2} \\ \leq λ_{M}^{2} {(1 + γ)}^{2} ‖ Φ (x - z) ‖_{D}^{2} . \end{array}

(B.5)

The last inequality follows from the triangle inequality. We now invoke Assumption (3) and combine (B.5) with (B.4) as follows:

\begin{array}{l} ‖ \sum_{s \in S} π (s) (F (x, s) - F (z, s)) ‖^{2} \leq ‖ x - z ‖^{2} - 2 (1 - γ) ‖ Φ (x - z) ‖_{D}^{2} + λ_{M}^{2} {(1 + γ)}^{2} ‖ Φ (x - z) ‖_{D}^{2} \\ < ‖ x - z ‖^{2} - 2 (1 - γ) ‖ Φ (x - z) ‖_{D}^{2} + {(\frac{\sqrt{2 (1 - γ)}}{1 + γ})}^{2} {(1 + γ)}^{2} ‖ Φ (x - z) ‖_{D}^{2} \\ = ‖ x - z ‖^{2} . \end{array}

(B.6)

This gives us the required contraction property with contraction factor $α$ for which an explicit expression can be obtained, using the first inequality in (B.6) as

α = \sqrt{1 - \min_{x \neq θ} \frac{‖ Φ x ‖_{D}^{2}}{‖ x ‖^{2}} (2 (1 - γ) - λ_{M}^{2} {(1 + γ)}^{2})} .

Note that as the columns of $Φ$ are linearly independent, $x \neq θ \Rightarrow Φ x \neq θ$ , and hence, $\frac{‖ Φ x ‖_{D}}{‖ x ‖} > 0$ when $x \neq θ$ . Also, note that $\min_{x \neq θ} \frac{‖ Φ x ‖_{D}}{‖ x ‖} = \min_{‖ x ‖ = 1} ‖ Φ x ‖_{D}$ , and hence, by the extreme value theorem, we have that $\min_{‖ x ‖ = 1} ‖ Φ x ‖_{D}$ is attained and is greater than zero. Along with Assumption (3), this implies that $α < 1$ . $□$

B.2. Proof of Lemma 2

Proof.

Using definitions of $U (\cdot)$ and $W (\cdot)$ , we have

\begin{array}{l} \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (F (x_{k}, Y_{k}) - \sum_{s \in S} π (s) F (x_{k}, s)) \\ = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (U (Y_{k}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) U (s^{'})) \end{array}

(B.7a)

\begin{array}{l} + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (W (Y_{k}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) W (s^{'})) x_{k} . \end{array}

(B.7b)

We first simplify (B.7a) as follows:

\begin{array}{l} \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (U (Y_{k}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) U (s^{'})) \\ = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (U (Y_{k + 1}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) U (s^{'})) \end{array}

(B.8a)

\begin{array}{l} + \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) U (Y_{k})) \end{array}

(B.8b)

\begin{array}{l} + χ (m, n_{0} + 1) a (n_{0}) U (Y_{n_{0}}) - χ (m, m + 1) a (m) U (Y_{m + 1}) . \end{array}

(B.8c)

For (B.8a), define ${\tilde{U}}_{k + 1} = U (Y_{k + 1}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) U (s^{'})$ for $k \geq n_{0}$ , and zero otherwise. This is a martingale difference sequence with respect to ${F_{n}}$ .

We define $U_{max} ≔ \max_{i \in S} ‖ U (i) ‖$ and bound the norm of (B.8b) as follows:

\begin{array}{l} ‖ \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) U (Y_{k})) ‖ \\ \leq ‖ \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) a (k) - χ (m, k + 1) a (k - 1)) U (Y_{k}) ‖ \\ + ‖ \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) a (k - 1) - χ (m, k) a (k - 1)) U (Y_{k}) ‖ \\ \leq \sum_{k = n_{0} + 1}^{m} ((a (k - 1) - a (k)) χ (m, k + 1) U_{max}) + \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) - χ (m, k)) a (k - 1) U_{max}) \\ \leq \sum_{k = n_{0} + 1}^{m} ((a (k - 1) - a (k)) U_{max}) + \sum_{k = n_{0} + 1}^{m} ((χ (m, k + 1) - χ (m, k)) a (n_{0}) U_{max}) \\ = (a (n_{0}) - a (m)) U_{max} + (χ (m, m + 1) - χ (m, n_{0} + 1)) a (n_{0}) U_{max} \\ \leq 2 a (n_{0}) U_{max} . \end{array}

(B.9)

The second and third inequalities follow from $a (k - 1) - a (k) \geq 0$ because $a (k)$ is a nonincreasing sequence for $k > n_{0}$ , and $χ (m, k + 1) - χ (m, k)$ is positive because $1 \geq χ (m, k + 1) \geq χ (m, k)$ for $m, k > n_{0}$ , as $a (k) < 1$ for $k > n_{0}$ . Note that the norm of (B.8c) is directly bounded by $2 a (n_{0}) U_{max}$ .

Now, we simplify (B.7b) as follows:

\begin{array}{l} \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (W (Y_{k}) - \sum_{s^{'} \in S} (p (s^{'} | Y_{k}) W (s^{'})) x_{k} \\ = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (W (Y_{k + 1}) - \sum_{s^{'} \in S} (p (s^{'} | Y_{k}) W (s^{'})) x_{k} \end{array}

(B.10a)

\begin{array}{l} + \sum_{k = n_{0} + 1}^{m} (χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) W (Y_{k}) x_{k} \end{array}

(B.10b)

\begin{array}{l} + \sum_{k = n_{0} + 1}^{m} χ (m, k) a (k - 1) W (Y_{k}) (x_{k} - x_{k - 1}) \end{array}

(B.10c)

\begin{array}{l} + χ (m, n_{0} + 1) a (n_{0}) W (Y_{n_{0}}) x_{n_{0}} - χ (m, m + 1) a (m) W (Y_{m + 1}) x_{m} . \end{array}

(B.10d)

Similar to the sequence ${\tilde{U}}_{k + 1}$ , for (B.10a), define ${\tilde{W}}_{k + 1} = W (Y_{k + 1}) - \sum_{s^{'} \in S} p (s^{'} | Y_{k}) W (s^{'})$ for $k \geq n_{0}$ , and zero otherwise. Note that ${\tilde{W}}_{k + 1} x_{k}$ is a martingale difference sequence with respect to ${F_{n}}$ .

Define $W_{max} ≔ \max_{i \in S} ‖ W (i) ‖$ . Note that, here, $‖ W (i) ‖$ denotes the operator norm of a matrix, that is, $‖ W (i) ‖ = \sup_{x \neq θ} \frac{‖ W (i) x ‖}{‖ x ‖}$ , using the Euclidean norm for vectors. Similar to (B.8b), we bound the norm of (B.10b) as follows:

\begin{array}{l} ‖ \sum_{k = n_{0} + 1}^{m} (χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) W (Y_{k}) x_{k} ‖ \\ \leq ‖ \sum_{k = n_{0} + 1}^{m} (χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) W (Y_{k}) (x_{k} - z_{k}) ‖ \\ + ‖ \sum_{k = n_{0} + 1}^{m} (χ (m, k + 1) a (k) - χ (m, k) a (k - 1)) W (Y_{k}) z_{k} ‖ \\ \leq 2 a (n_{0}) W_{max} (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖) . \end{array}

The last inequality here follows from the definition of $x_{m}^{'} = \sup_{n_{0} \leq k \leq m} ‖ x_{m} - z_{m} ‖$ and from the bound on $‖ z_{n} ‖$ (5). For (B.10c), let us first bound $‖ x_{k} - x_{k - 1} ‖$ .

\begin{array}{l} ‖ x_{k} - x_{k - 1} ‖ & = a (k) ‖ φ (Y_{k - 1}) (r (Y_{k - 1}) + γ φ {(Y_{k})}^{T} x_{k - 1} - φ {(Y_{k - 1})}^{T} x_{k - 1}) ‖ \\ \leq a (k) (K_{1} + K_{2} ‖ x_{k - 1} ‖) \\ \leq a (n_{0}) (K_{1} + K_{2} (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖)), \end{array}

for appropriate

K_{1}

and

K_{2}

. Before simplifying (B.10c), we first need to repeat an important simplification from our main proof. Note that for any

0 < k \leq m

χ (m, k) + χ (m, k + 1) a (k) = χ (m, k + 1),

and hence,

χ (m, n_{0}) + \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) = χ (m, m + 1) = 1 .

This implies that

\sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) \leq 1 .

We can finally bound the norm of (B.10c):

\begin{array}{l} ‖ \sum_{k = n_{0} + 1}^{m} χ (m, k) a (k - 1) W (Y_{k}) (x_{k} - x_{k - 1}) ‖ \\ \leq \sum_{k = n_{0} + 1}^{m} χ (m, k) a (k - 1) ‖ W (Y_{k}) (x_{k} - x_{k - 1}) ‖ \\ \leq \sum_{k = n_{0} + 1}^{m} χ (m, k) a (k - 1) a (n_{0}) W_{max} (K_{1} + K_{2} (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖)) \\ \leq a (n_{0}) W_{max} (K_{1} + K_{2} (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖)) . \end{array}

Finally, the norm of (B.10d) can directly be bounded by

\begin{array}{l} ‖ χ (m, n_{0} + 1) a (n_{0}) W (Y_{n_{0}}) x_{n_{0}} - χ (m, m + 1) a (m) W (Y_{m + 1}) x_{m} ‖ \leq 2 a (n_{0}) W_{max} (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖) . \end{array}

Combining the bounds above gives us

\begin{array}{l} \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (F (x_{k}, Y_{k}) - \sum_{s \in S} π (s) F (x_{k}, s)) \\ = \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) ({\tilde{U}}_{k + 1} + {\tilde{W}}_{k + 1} x_{k}) + μ_{m} (n_{0}), \end{array}

where

‖ μ_{m} (n_{0}) ‖ \leq 4 a (n_{0}) U_{max} + a (n_{0}) W_{max} (K_{1} + (4 + K_{2}) (x_{m}^{'} + ‖ x_{n_{0}} - x^{*} ‖ + ‖ x^{*} ‖)) .

Define constants $c_{1} ≔ W_{max} (4 + K_{2})$ and $c_{2} ≔ 4 U_{max} + K_{1} W_{max} + c_{1} ‖ x^{*} ‖$ . This completes the proof of Lemma 2. $□$

B.3. Proof of Lemma 3

Proof.

We first note that for $n_{0} \leq k \leq m$ , $‖ {\bar{x}}_{k, m} ‖ \leq ‖ {\bar{x}}_{k, m} - {\bar{z}}_{k, m} ‖ + ‖ {\bar{z}}_{k, m} ‖ \leq {\bar{x}}_{m, m}^{'} + ‖ {\bar{z}}_{k, m} ‖$ . The following follow from the definition of $B_{m}$ . If $ξ \in B_{m}$ , ${\bar{x}}_{m, m}^{'} (ξ) = 0$ and if $ξ \notin B_{m}$ , ${\bar{x}}_{m, m}^{'} (ξ) = x_{m}^{'} (ξ) \leq \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}}$ . Hence,

{\bar{x}}_{m, m}^{'} \leq \frac{a (n_{0}) (c_{2} + c_{1} ϵ) + δ}{1 - α - a (n_{0}) c_{1}} .

Using (5), we have $‖ {\bar{z}}_{k, m} ‖ \leq ϵ + ‖ x^{*} ‖$ . Under the condition that $ϵ \leq 1$ and $δ \leq 1$ , we have

‖ {\bar{x}}_{k, m} ‖ \leq 1 + ‖ x^{*} ‖ + \frac{a (n_{0}) (c_{2} + c_{1}) + 1}{1 - α - a (n_{0}) c_{1}} .

Let $v^{(ℓ)}$ denote the $ℓ$ th component of a vector v. Then,

\begin{array}{l} Γ_{m} : & = ‖ \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) (M_{k + 1} {\bar{x}}_{k, m} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, m} + {\tilde{U}}_{k + 1}) ‖ \\ \leq \sqrt{d} \max_{1 \leq ℓ \leq d} | \sum_{k = n_{0}}^{m} χ (m, k + 1) a (k) {(M_{k + 1} {\bar{x}}_{k, m} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, m} + {\tilde{U}}_{k + 1})}^{(ℓ)} | . \end{array}

Recall that d here is the dimension of our iterates ${x_{n}}$ . We apply Theorem A.1 from Appendix A component-wise. For this, first note that

{(M_{k + 1} {\bar{x}}_{k, m} + {\tilde{W}}_{k + 1} {\bar{x}}_{k, m} + {\tilde{U}}_{k + 1})}^{(ℓ)} \leq c_{3} (2 + ‖ x^{*} ‖ + \frac{a (n_{0}) (c_{2} + c_{1}) + 1}{1 - α - a (n_{0}) c_{1}}),

where

c_{3} = \max {M_{max} + 2 W_{max}, 2 U_{max}}

. In the theorem statement, let

C = \sqrt{d} c_{3} (2 + ‖ x^{*} ‖ + \frac{a (n_{0}) (c_{2} + c_{1}) + 1}{1 - α - a (n_{0}) c_{1}}), ζ_{k, m} = χ (m, k + 1) a (k), ε = 1, γ_{1} = 1 .

Next, we choose suitable $γ_{2}$ and $ω (m)$ such that $\max_{n_{0} \leq k \leq m} ζ_{k, m} \leq γ_{2} ω (m)$ . For this, we use our assumption that $\frac{d_{1}}{n + 1} \leq a (n) \leq d_{3} {(\frac{1}{n + 1})}^{d_{2}}, \forall n \geq n_{0}$ , to obtain

\begin{array}{l} χ (m, k + 1) = \prod_{i = k + 1}^{m} (1 - a (i)) \leq \exp (- \sum_{i = k + 1}^{m} a (i)) \leq \exp (- \sum_{i = k + 1}^{m} \frac{d_{1}}{i + 1}) \\ \leq \exp (- \int_{k + 1}^{m + 1} \frac{d_{1}}{y + 1} d y) \leq \exp (d_{1} (\log (k + 2) - \log (m + 2))) \\ = {(\frac{k + 2}{m + 2})}^{d_{1}} \\ \Rightarrow \max_{n_{0} \leq k \leq m} a (k) χ (m, k + 1) \leq \max_{n_{0} \leq k \leq m} d_{3} {(\frac{1}{k})}^{d_{2}} {(\frac{k + 2}{m + 2})}^{d_{1}} \leq \max_{n_{0} \leq k \leq m} d_{3} {(\frac{1}{k})}^{d_{2}} {(\frac{2 k}{m + 2})}^{d_{1}} . \end{array}

From the last inequality, $γ_{2} = d_{3} 2^{d_{1}}$ and $ω (m) = β_{n_{0}} (m)$ satisfy the required conditions. Then, there exists a constant $D > 0$ such that for $n_{0} < m$ and $δ \in (0, C]$ , we have

P (Γ_{m} \geq δ) \leq 2 d e^{- D δ^{2} / β_{n_{0}} (m)},

and for

δ > C

P (Γ_{m} \geq δ) \leq 2 d e^{- D δ / β_{n_{0}} (m)} .

The factor d comes from the application of union bound to bound the maximum over all components. Under the assumption that $δ \leq 1$ , we have that $e^{- D δ^{2} / β_{n_{0}} (m)} \geq e^{- D δ / β_{n_{0}} (m)}$ , and hence, $P (Γ_{m} \geq δ) \leq 2 d e^{- D δ^{2} / β_{n_{0}} (m)}$ . $□$

B.4. Proof of Corollary 1

To show Corollary 1, we first obtain values of $δ$ and $ϵ$ such that the probability in Theorem 1 is $1 - ε_{1} - ε_{2}$ . We use $s_{i}, i = 1, 2, \dots$ to denote different constants throughout this proof. For $a (n) = d_{1} / (n + 1)$ with a sufficiently large $d_{1}$ , we have $β_{n_{0}} (m) \leq 1 / m$ . This implies that

\sum_{m \geq n_{0} + 1} \exp (- D δ^{2} / β_{n_{0}} (m)) \leq \sum_{m \geq n_{0} + 1} \exp (- D δ^{2} m) \leq s_{1} \exp (- D δ^{2} n_{0}) .

Let $ε_{1} / (2 d) = s_{1} \exp (- D δ^{2} n_{0})$ , which gives us $δ = s_{2} n_{0}^{- 1 / 2} \log^{1 / 2} (s_{3} / ε_{1})$ for appropriate constants $s_{2}$ and $s_{3}$ . This choice of $δ$ gives us

2 d \sum_{m \geq n_{0} + 1} \exp (- D δ^{2} / β_{n_{0}} (m)) \leq ε_{1} .

Let $ϵ = \sqrt{E [‖ x_{n_{0}} - x^{*} ‖^{2}]} / \sqrt{ε_{2}}$ , which implies that

P (‖ x_{n_{0}} - x^{*} ‖ > ϵ) = P (‖ x_{n_{0}} - x^{*} ‖^{2} > \frac{E [‖ x_{n_{0}} - x^{*} ‖^{2}]}{ε_{2}}) \leq ε_{2} .

Here, the last inequality follows from Markov’s inequality. Being a linear contractive stochastic approximation iteration with an aperiodic irreducible Markov chain, our formulation satisfies the assumptions for Chen et al. (2021, theorem 2.1). To apply their result, we note the corresponding mapping between constants: the norm $‖ \cdot ‖_{c}$ is the Euclidean norm in our case, h is one, $φ_{2}$ is $1 - α$ in our case, and their $α$ is $d_{1}$ in our case. For $d_{1} > 1 / (1 - α)$ and $n_{0}$ sufficiently large to satisfy the condition for Chen et al. (2021, theorem 2.1(2)), we can use their result (Chen et al. (2021, theorem 2.1(2)(b)(iii)) to obtain the following mean square bound:

E [‖ x_{n_{0}} - x^{*} ‖^{2}] \leq s_{4} {(\frac{1}{n_{0} + 1})}^{(1 - α) d_{1}} + s_{5} \frac{\log (n_{0} + 1)}{n_{0} + 1} \leq s_{6} \frac{\log (n_{0} + 1)}{n_{0} + 1} .

Substituting the values of $δ$ and $ϵ$ in our bound, we get, with probability greater than $1 - ε_{1} - ε_{2}$ ,

\begin{array}{l} ‖ x_{m} - x^{*} ‖ & \leq e^{- (1 - α) b_{n_{0}} (m - 1)} \frac{\sqrt{E [‖ x_{n_{0}} - x^{*} ‖^{2}]}}{\sqrt{ε_{2}}} \\ + s_{7} (\frac{c_{2} d_{1}}{n_{0} + 1} + \frac{c_{1} d_{1}}{n_{0} + 1} \frac{\sqrt{E [‖ x_{n_{0}} - x^{*} ‖^{2}]}}{\sqrt{ε_{2}}} + s_{2} n_{0}^{- 1 / 2} \log^{1 / 2} (s_{3} / ε_{1})) \\ \leq e^{- (1 - α) b_{n_{0}} (m - 1)} \frac{s_{6}}{\sqrt{ε_{2}}} \sqrt{\frac{\log (n_{0} + 1)}{n_{0} + 1}} \\ + s_{7} (\frac{c_{2} d_{1}}{n_{0} + 1} + \frac{c_{1} d_{1}}{n_{0} + 1} \frac{s_{6}}{\sqrt{ε_{2}}} \sqrt{\frac{\log (n_{0} + 1)}{n_{0} + 1}} + s_{2} n_{0}^{- 1 / 2} \log^{1 / 2} (s_{3} / ε_{1})) \end{array}

for all

m \geq n_{0}

. Now,

\begin{array}{l} \exp (- (1 - α) b_{n_{0}} (m - 1)) \\ \leq \exp (- \sum_{i = n_{0}}^{m - 1} (1 - α) a (i)) \leq \exp (- \sum_{i = n_{0}}^{m - 1} \frac{(1 - α) d_{1}}{i + 1}) \\ \leq \exp (- \int_{n_{0}}^{m} \frac{(1 - α) d_{1}}{y + 1} d y) \leq \exp ((1 - α) d_{1} (\log (k + 1) - \log (m + 1))) \\ = {(\frac{n_{0} + 1}{m + 1})}^{d_{1} (1 - α)} \leq \frac{n_{0} + 1}{m + 1} . \end{array}

Here, the final inequality follows from the assumption that $(1 - α) d_{1} > 1$ . Hence, we get that for sufficiently large $n_{0}$ , the following holds with probability $1 - ε_{1} - ε_{2}$ for all $m \geq n_{0}$ :

‖ x_{m} - x^{*} ‖ = O (\frac{1}{\sqrt{n_{0}}} \log^{1 / 2} (\frac{1}{ε_{1}}) + \sqrt{\frac{\log (n_{0})}{n_{0}}} \frac{1}{\sqrt{ε_{2}}} (\frac{n_{0}}{m} + \frac{1}{n_{0}})) .

Endnote

¹ In Chandak et al. (2022), TD(0) does not satisfy Assumption (6), which is required in the proof of Lemma 1 that shows almost-sure boundedness of the iterates. The authors thank Zaiwei Chen for pointing this out.

References

Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 70 (PMLR, New York), 263–272.Google Scholar
Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. 31st Conf. Learn. Theory (PMLR, New York), 1691–1692.Google Scholar
Borkar VS (1991) Topics in Controlled Markov Chains, Pitman Research Notes in Mathematics Series, vol. 240 (Longman Scientific & Technical, Harlow, Essex, UK).Google Scholar
Borkar VS (2002) On the lock-in probability of stochastic approximation. Combin. Probab. Comput. 11(1):11–20.Google Scholar
Borkar VS (2022) Corrigendum to “A concentration bound for contractive stochastic approximation” [Syst. Control Lett. 153 (2021) 104947]. Systems Control Lett. 153:104947.Google Scholar
Chandak S, Borkar VS, Dodhia P (2022) Concentration of contractive stochastic approximation and reinforcement learning. Stochastic Systems 12(4):411–430.Link, Google Scholar
Chandak S, Borkar VS, Dolhare H (2023) A concentration bound for LSPE( $λ$ ). Systems Control Lett. 171:105418.Google Scholar
Chen Z, Maguluri ST, Zubeldia M (2025) Concentration of contractive stochastic approximation: Additive and multiplicative noise. Ann. Appl. Probab. 35(2):1298–1352.Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS’20: Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. Preprint, submitted February 2, https://arxiv.org/abs/2102.01567.Google Scholar
Chung F, Lu L (2006) Concentration inequalities and martingale inequalities: A survey. Internet Math. 3(1):79–127.Google Scholar
Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. AAAI Conf. Artificial Intelligence, vol. 32 (AAAI Press, Palo Alto, CA).Google Scholar
Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, eds. NIPS’18: Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 4868–4878.Google Scholar
Kamal S (2010) On the convergence, lock-in probability, and sample complexity of stochastic approximation. SIAM J. Control Optim. 48(8):5178–5192.Google Scholar
Li G, Cai C, Chen Y, Wei Y, Chi Y (2023) Is Q-learning minimax optimal? A tight sample complexity analysis. Oper. Res. 72(1):222–236.Google Scholar
Liu Q, Watbled F (2009) Exponential inequalities for martingales and asymptotic properties of the free energy of directed polymers in a random environment. Stochastic Processes Their Appl. 119(10):3101–3132.Google Scholar
Meerkov SM (1972) Simplified description of slow Markov walks. Part II. Automation Remote Control 33(5):761.Google Scholar
Patil G, Prashanth LA, Nagaraj D, Precup D (2023) Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. Proc. 26th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 206 (PMLR, New York), 5438–5448.Google Scholar
Prashanth LA, Korda N, Munos R (2021) Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling. Machine Learn. 110(3):559–618.Google Scholar
Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Proc. 33rd Conf. Learn. Theory (PMLR, New York), 3185–3205.Google Scholar
Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. 32nd Conf. Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
Tao T, Vu V (2015) Random matrices: Universality of local spectral statistics of non-Hermitian matrices. Ann. Probab. 43(2):782 –874.Google Scholar
Thoppe G, Borkar V (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems 9(1):1–26.Link, Google Scholar
Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Google Scholar
Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York), 6995–7004.Google Scholar
Yang Z, Jin C, Wang Z, Wang M, Jordan M (2020) On function approximation in reinforcement learning: Optimism in the face of large state spaces. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS’20: Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 13903–13916.Google Scholar

Volume 16, Issue 1

March 2026

Pages 1-107

Article Information

Metrics

Information

Received:December 16, 2023
Accepted:October 24, 2025
Published Online:December 26, 2025

Cite as

Siddharth Chandak, Vivek S. Borkar (2025) A Concentration Bound for TD(0) with Function Approximation. Stochastic Systems 16(1):44-60.

https://doi.org/10.1287/stsy.2023.0055

Keywords

PDF download

Available Issues

Available Issues

Available Issues

A Concentration Bound for TD(0) with Function Approximation

Abstract

1. Introduction

1.1. Related Works

1.2. Outline and Notation

2. Background on TD(0)

2.1. Assumptions

2.2. Formulation as a Stochastic Approximation Iteration

3. Main Result

4. Proof of the Main Result

5. Conclusions

Appendix A. A Martingale Inequality

Appendix B. Technical Proofs

B.1. Proof of Lemma 1

B.2. Proof of Lemma 2

B.3. Proof of Lemma 3

B.4. Proof of Corollary 1

References

Volume 16, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News