Subsampling Algorithms for Semidefinite Programming

We derive a stochastic gradient algorithm for semidefinite optimization using randomization techniques. The algorithm uses subsampling to reduce the computational cost of each iteration and the subsampling ratio explicitly controls granularity, i.e. the tradeoff between cost per iteration and total number of iterations. Furthermore, the total computational cost is directly proportional to the complexity (i.e. rank) of the solution. We study numerical performance on some large-scale problems arising in statistical learning.


Introduction
Beyond classic combinatorial relaxations [GW95], semidefinite programming has recently found a new stream of applications in machine learning [LCB + 02], geometry [WS06], statistics [dBEG06] or graph theory [SBXD06]. All these problems have a common characteristic: they have relatively low precision targets but form very large semidefinite programs for which obtaining second order models is numerically hopeless which means that Newton based interior point solvers typically fail before completing even a single iteration. Early efforts focused on exploiting structural properties of the problem (sparsity, block patterns, etc), but this has proven particularly hard for semidefinite programs. For very large problem instances, first-order methods remain at this point the only credible alternative. This follows a more general trend in optimization which seeks to significantly reduce the granularity of solvers, i.e. reduce the per iteration complexity of optimization algorithms rather than their total computational cost, thus allowing at least some progress to be made on problems that are beyond the reach of current algorithms.
In this work, we focus on the following spectral norm minimization problem in the variable y ∈ R p , with parameters A j ∈ S n , for j = 1, . . . , p, b ∈ R p and C ∈ S n , where Q is a compact convex set. Throughout the paper, we also implicitly assume that the set Q ⊂ R p is simple enough so that the complexity of projecting y on Q is relatively low compared to the other steps in the algorithm. The idea behind this paper stems from a recent result by [JLNS09], who used a mirror descent stochastic approximation algorithm for solving bilinear matrix games (see [Nes09], [PJ92] or [NY83] for more background), where subsampling is used to perform matrix vector products and produce an approximate gradient. Strikingly, the algorithm has a total complexity of O(n log n/ǫ 2 ), when the problem matrix is n × n, hence only requires access to a negligible proportion of the matrix coefficients as the dimension n tends to infinity.
In parallel, recent advances in large deviations and random matrix theory have produced a stream of new randomization results for high dimensional linear algebra (see [FKV04,DKM06,AM07,KV09] among many others), motivated by the need to perform these operations on very large scale, sometimes streaming, data sets in applications such as machine learning, signal processing, etc. Similar subsampling techniques have been successfully applied to support vector machine classification [KBH08] or Fourier decomposition. Randomization results were used in [AK07] to produce complexity bounds for certain semidefinite programs arising in combinatorial relaxations of graph problems. Randomization was also used in [BLO02] and [BLO05] to approximate subdifferentials of functions that are only differentiable almost everywhere.
Our contribution here is to further reduce the granularity of first-order semidefinite programming solvers by combining subsampling procedures with stochastic approximation algorithms to derive stochastic gradient methods for spectral norm minimization with very low complexity per iteration. In practice, significantly larger per iteration complexity and memory requirements mean that interior point techniques often fail to complete a single iteration on very large problem instances. CPU clock also runs much faster than RAM, so operations small enough to be performed entirely in cache (which runs at full speed) are much faster than those requiring larger data sets. Solver performance on very large problem instances is then often more constrained by memory bandwidth than clock speed, hence everything else being equal, algorithms running many cheap iterations will be much faster than those requiring fewer, more complex ones. Here, subsampling techniques allow us to produce semidefinite optimization algorithms with very low cost per iteration, where all remaining O(n 2 ) operations have a small constant and can be performed in a single pass over the data.
We also observe that the relative approximation error in computing the spectral norm (or trace norm) of a matrix using subsampling is directly proportional to the numerical rank of that matrix, hence another important consequence of using subsampling techniques to solve large-scale semidefinite programs is that the total complexity of running the algorithm becomes explicitly dependent on the complexity (i.e. rank) of its solution.
The paper is organized as follows. Section 2 surveys some key results on randomized linear algebra and spectral norm approximations. In Section 3 we then derive a stochastic approximation algorithm for spectral norm minimization with very low cost per iteration and discuss some extensions to statistical learning problems. Finally, we present some numerical experiments in Section 4.

Notation
We write S n the set of symmetric matrices of dimension n. For a matrix X ∈ R m×n , we write X F its Frobenius norm, X tr its trace norm, X 2 its spectral norm, σ i (X) its i-th largest singular value and let X ∞ = max ij |X ij |, while X (i) is the i-th column of the matrix X and X (i) its i-th row. We write vec(X) the vector of R mn obtained by stacking up the columns of the matrix X and NumRank(X) the numerical rank of the matrix X, where NumRank(X) = X 2 F / X 2 2 . Finally, when x ∈ R n is a vector, we write x 2 its Euclidean norm, while · is a general norm on R m and · * its dual norm.

Randomized linear algebra
In this section, we survey several results by [DKM06] which, after a single pass on the data, sample columns to approximate matrix products and produce low rank matrix approximations with a complexity of O(sn) where s is the sampling rate.

Randomized matrix multiplication
Algorithm 1 Matrix multiplication Input: A ∈ R m×n , B ∈ R n×p and s such that 1 ≤ s ≤ n.
2: Define subsampled matrices C ∈ R m×s and R ∈ R s×p as follows. 3: for i = 1 to s do

5:
Set 6: end for Output: Matrix product CR approximating AB.
By construction, we have E[CR] = AB, and the following randomization result from [DKM07] controls the precision of the approximations in algorithm 1.
Lemma 1 Let A ∈ R m×n , B ∈ R n×p , given a subsampling rate s such that 1 ≤ s ≤ n, suppose that C ∈ R m×s and R ∈ R s×p are computed according to algorithm 1 above, then with probability at least 1 − β.
Note that using the adaptive probabilities q i is crucial here. The error bounds increase by a factor n when q i = 1/n for example.

Randomized low-rank approximation
Algorithm 2 below computes the leading singular vectors of a smaller matrix S, which is a subsampled and rescaled version of X. Here, the computational savings come from the fact that we only need to compute singular values of a matrix of dimension m × s with s ≤ n. Recall that Algorithm 2 Low-rank approximation Input: X ∈ R m×n and k, s such that 1 ≤ k ≤ s < n.
1: Define a probability vector q ∈ R n such that q i = X (i) 2 2 / X 2 F , for i = 1, . . . , n. 2: Define a subsampled matrix S ∈ R m×s as follows. 3: for i = 1 to s do 4: Pick an index j ∈ [1, n] with P(j = l) = q l .

5:
computing k leading eigenvectors of a symmetric matrix of dimension s only requires matrix vector products, hence can be performed in O(ks 2 log s) operations using iterative algorithms such as the power method or Lanczos method (see the appendix for details, as usual we omit the precision target in linear algebra operations, implicitly assuming that it is much finer than ǫ), so that the cost of computing k leading singular vectors of a matrix of size m × s is O(ksm log m).
This means that, given the probabilities q i , the total cost of obtaining k approximate singular vectors using algorithm 2 is O(ksm log m) instead of O(knm log m) for exact singular vectors. Of course, computing q i requires mn operations, but can be done very efficiently in a single pass over the data. We now recall the following result from [DKM06] which controls the precision of the approximations in algorithm 2.
Lemma 2 Let X ∈ R m×n and 1 ≤ k ≤ s < n. Given a precision target ǫ > 0, if s ≥ 4/ǫ 2 and H ∈ R m×k is computed as in algorithm 2, we have and if in addition s > 4η 2 /ǫ 2 where η = 1 + 8 log(1/β) for β ∈ [0, 1], then with probability at least 1 − β, where X k is the best rank k approximation of X.
An identical precision bound holds in the Frobenius norm when s ≥ 4k/ǫ 2 . We now adapt these results to our setting in the following lemma, which shows how to approximate the spectral radius of a symmetric matrix X using algorithm 2.
Proof. Using the Hoffman-Wielandt inequality (see [SS90,Th. 3.1] or the proof of [DKM06, Th.2] for example) we get F / X 2 and Jensen's inequality together with the matrix multiplication result in Lemma 1 yields with probability at least 1 − β. Combining these two inequalities with the sampling rate in (2) yields the desired result.
The subsampling rate required to achieve a precision target ǫ has a natural interpretation. Indeed s = η 2 X 2 2 ǫ 2 NumRank(X) 2 is simply the squared ratio of the numerical rank of the matrix X over the relative precision target ǫ/ X 2 , times a factor η 2 controlling the confidence level. The numerical rank NumRank(X) always satisfies 1 ≤ NumRank(X) = X 2 F / X 2 2 ≤ Rank(X) and can be seen as a stable relaxation of the rank of the matrix X (see [RV07] for a discussion). Note also that, by construction, the subsampled matrix always has lower rank than the matrix X. The expectation bound is still valid if we drop the factor η.

Stochastic approximation algorithm
Below, we will use a stochastic approximation algorithm to solve problem (1) when the gradient is approximated using the subsampling algorithms detailed above. We focus on a stochastic approximation of problem (1) written in the variable y ∈ R p and parameters A j ∈ S n , for j = 1, . . . , p, b ∈ R p and C ∈ S n , with 1 ≤ s ≤ n controlling the sampling rate, where the function π (s) ( p j=1 y j A j + C) 2 and a subgradient with respect to y are computed using algorithms 1 and 2. For X ∈ S n , we have written π (s) (X) the subsampling/scaling operation used in algorithms 1 and 2 with where 0 < s < n controls the sampling rate and S ∈ R n×s is the random matrix defined in algorithm 2 whose columns are a scaled sample of the columns of X. We will write S = π (s) (X) the matrix obtained by subsampling rows as in algorithm 1. We also define A ∈ R n 2 ×p as the matrix whose columns are given by A (j) = vec(A j ), j = 1, . . . , p.

Stochastic approximation algorithm
We show the following lemma approximating the gradient of the function π (s) ( p j=1 y j A j + C) 2 with respect to y and bounding its quadratic variation.
Lemma 4 Given A j ∈ S n with A ∈ R n 2 ×p defined as above, for j = 1, . . . , p, b ∈ R p , C ∈ S n and sampling rates s 1 and s 2 , a (stochastic) subgradient of the function π (s 1 ) ( p j=1 y j A j + C) 2 − b T y with respect to y is given by the vector w ∈ R p with where v ∈ R n is a leading singular vector of the subsampled matrix S = π (s 1 ) (X) formed in algorithm 2. Furthermore, the product A T vec(vv T ) can be approximated using algorithm 1 to form an approximate gradient which satisfies Proof. Iterated expectations give E[g] = E[w] ∈ ∂f (y). The sampling probabilities q i used in approximating the matrix vector product A T vec(vv T ) following algorithm 1 are defined as As in [DKM07, Lemma 3], the quadratic variation of the approximate product π (s 2 ) (A T ) π (s 2 ) (vec(vv T )) is then given by With p i defined as above, we get We now use this result to produce an explicit bound on the complexity of solving problems (3) and (1) by subsampling using a stochastic approximation algorithm. In this section, we let · be a general norm on R p , we write · * its dual norm and define δ * (p) as the smallest number such that y 2 ≤ δ * (p) y * for all y ∈ R p . Following the notation in [JLNS09, §2.3], we let ω(x) be a distance generating function, i.e. a function such that is a convex set. We assume that ω(x) is strongly convex on Q o with modulus α with respect to the norm · , which means We then define a prox-function V (x, y) on Q o × Q as follows: which is nonnegative and strongly convex with modulus α with respect to the norm · . The prox-mapping associated to V is then defined as Finally, we define the ω diameter of the set Q as: and we let γ l for l = 0, . . . , N be a step size strategy.
Algorithm 3 Spectral norm minimization using subsampling Input: Matrices A j ∈ S n , for j = 1, . . . , p, b ∈ R p and C ∈ S n , sampling rates s 1 and s 2 . 1: Pick initial y 0 ∈ Q 2: for l = 1 to N do 3: Compute v ∈ R n , the leading singular vector of the matrix π (s 1 ) ( p j=1 y l,j A j +C), subsampled according to algorithm 2 with k = 1 and s = s 1 .

6:
Update the running averageỹ N = N k=0 γ l y l / N k=0 γ l . 7: end for Output: An approximate solutionỹ N ∈ R p of problem (3) with high probability.
The following results control the convergence of the robust stochastic approximation algorithm 3 (see [JLNS09], [Nes09], [PJ92] or [NY83] for further details). We callȳ the optimal solution of problem (3), the lemma below characterizes convergence speed in expectation.
Lemma 5 Given N > 0, let M * be defined as in (5) by using a fixed step size strategy with Lemma 5 means that we need at most iterations to get an ǫ solution to problem (3) with confidence at least 1 − β. Typically, the prox function ω and the norm are chosen according to the geometry of Q, to minimize N . The choice of norm also affects δ * (p) and obtaining better bounds on M * in (5) for generic norms would further tighten this complexity estimate. We now call y * the solution to the original (deterministic) spectral norm minimization problem (1) and bound the suboptimality ofỹ N in the (true) problem (1) with high probability.
Theorem 1 If the sampling rate s 1 is set to with probability at least 1 − β.
Proof. Recall that we have written y * the solution to the original (deterministic) problem (1), y the solution to the approximate (stochastic) problem (3) andỹ N the N -th iterate of algorithm 3 above. Lemma 5 on the convergence ofỹ N to the solution of the stochastic problem in (3) means with probability at least 1 − β. By definition,ȳ minimizes the stochastic problem, so in particular with probability at least 1 − β. Now, with s 1 defined as above, Lemma 3 on the quality of the subsampling approximation to . 2 shows that if the sampling rate is set as in (8) then and Jensen's inequality yields which bounds the difference between the minimum of the (true) problem in (1) and the value f (y * ) of its stochastic approximation in (3), combining this with inequality (10) we finally get that with probability at least 1 − β, which is the desired result.
This result allows us to bound the oracle complexity of solving (1) by subsampling. In practice of course, both the spectral norm and the numerical rank of the solution matrix p j=1 y * j A j + C are unknown. However, assuming we have a stopping criterion, i.e. a function which efficiently certifies that a given y ∈ R p is optimal, we can search for the minimum sampling rate in (8) by starting from a low target and doubling the sampling rate until we obtain an optimal solution. The simple lemma below explicitly summarizes the complexity of this procedure.
Lemma 6 Suppose we start from a sampling rate s = 1 and run algorithm 3 repeatedly, doubling the sampling rate until the stopping criterion certifies the solution is optimal. Then, with probability at least 1 − β, algorithm 3 needs to be run at most ⌈log 2 (s 1 )⌉ times, where s 1 is given in (8), before it finds an optimal solution to problem (1).
Proof. Starting from s = 1, we simply need to double the sampling rate at most ⌈log 2 (s 1 )⌉ before it becomes larger than s 1 . At the sampling rate s = s 1 , algorithm 3 will produce an optimal solution with probability 1 − β.
In fact, we will see below that the complexity of each iteration is dominated by a term O(sn log n), where s is the sampling rate, and because we then observe that searching for the minimal sampling rate by repeatedly solving (3) for increasing sampling rates will be less than four times as expensive as solving the problem in the oracle case.
Typically, producing a stopping oracle means computing a duality gap and we will show in §3.3 how this can be done efficiently here. Note that in the absence of such a stopping criterion, the minimum sampling rate in (8) has to be enforced over all matrices X = p j=1 y j A j + C, which considerably increases overall complexity. The next section provides a detailed analysis of the complexity of algorithm 3 as a function of ǫ, s 1 and s 2 .

Complexity
We now study in detail the complexity of algorithm 3. Suppose we are given a precision target ǫ and fix the sampling rate s 2 arbitrarily between 1 and n 2 , with the sampling rate s 1 set as in Theorem 1. The cost of each iteration in algorithm 3 breaks down as follows.
• On line 3: Computing the leading singular vector v, using algorithm 2 with k = 1. This means first computing the probabilities q i at a cost of O(n 2 ) operations. Forming the matrix S = π (s 1 ) ( p j=1 y l,j A j + C) costs O(ns 1 ) operations. It remains to compute the leading singular vector of S using the Lanczos method at a cost of O(s 1 n log n) (cf. the appendix for details). The total numerical cost of this step is then bounded by c 1 n 2 + c 2 ns 1 where c 1 and c 2 are absolute constants. Here, c 1 is always less than ten while c 2 is the number of iterations required by the Lanczos method to reach a fixed precision target (typically 1e-8 or better here) hence we have c 1 ≪ c 2 .
• On line 4: Computing the approximate subgradient g l = π (s 2 ) (A T ) π (s 2 ) (vec(vv T )) − b, by subsampling the matrix product using algorithm 1. This means again forming the vector q at a cost of O(n 2 ) (the row norms of A can be precomputed). Computing the subsampled matrix vector product then costs O(ps 2 ). Both of these complexity bounds have low constants.
• On line 5: Computing the projection y l+1 = P Q,ω y l (γ l g l ), whose numerical cost will be denoted by c(p).
Let us remark in particular that all O(n 2 ) operations above only require one pass over the data, which means that the entire data set does not need to fit in memory. Using the bound on the quadratic variation of the gradient computed in Lemma 4, we can then bound the number of iterations required by algorithm 3 to produce a ǫ-solution to problem (1) with probability at least 1 − β. Let us call Y * = p j=1 y * j A j + C, and recall that η = 1 + 8 log(1/β), Table 1 summarizes these complexity bounds and compares them with complexity bounds for a stochastic approximation algorithm without subsampling.

Complexity
Stoch. Approx. Stoch. Approx. with Subsampling Per Iter. c 4 n 2 p + c(p) c 1 n 2 + c 3 ps 2 + c 2 n log n η 2 Y * 2 2 ǫ 2 NumRank(Y * ) 2 + c(p) Num. Iter. We observe that subsampling affects the complexity of solving problem (1) in two ways. Decreasing the (matrix product) subsampling rate s 2 ∈ [1, n 2 ] decreases the cost of each iterations but increases the number of iterations in the same proportion, hence has no explicit effect on the total complexity bound. In practice of course, because of higher cache memory speed and better bandwidth on smaller problems, cheaper iterations tend to run more efficiently than more complex ones. The impact of the (singular vector) subsampling rate s 1 ∈ [1, n] is much more important however, since computing the leading eigenvector of the current iterate is the most complex step in the algorithm when solving problem (1) using stochastic approximation. Because c 1 , c 3 ≪ c 2 , the per iteration complexity of solving large-scale problems essentially follows hence explicitly depends on both the numerical rank of the solution matrix Y * = p j=1 y * j A j + C and on the relative precision target ǫ/ Y * 2 . This means that problems with simpler solutions will be solved more efficiently than problems whose solutions has a high rank.
The choice of norm · and distance generating function also has a direct impact on complexity through c(p) and δ * (p)M * . Unfortunately here, subsampling error bounds are only available in the Frobenius and spectral norms hence part of the benefit of choosing optimal norm/distance generating function combinations is sometimes lost in the norm ratio bound δ * (p). However, choosing a norm/prox function combination according to the geometry of Q can still improve the complexity bound compared to a purely Euclidean setting.
Finally, subsampling can have a more subtle effect on complexity. By construction, solutions to problem (1) tend to have multiple leading singular values which coalesce near the optimum. Introducing noise by subsampling can potentially break this degeneracy and increase the gap between leading eigenvalues. Since the complexity of the algorithm depends in great part on the complexity of computing a leading singular vector using iterative methods such as the power method or the Lanczos method (cf. Appendix), and the complexity of these methods decreases as the gap between the two leading singular values increases, subsampling can also improve the efficiency of iterative singular value computations.

Surrogate Duality Gap
In practice, we often have no a priori knowledge of NumRank(Y * ) 2 and if the sampling rate s is set too low, it's possible for the algorithm to terminate at a suboptimal point Y where the subsampling error is less than ǫ (if the error at the true optimal point Y * is much larger than ǫ). In order to search for the optimal sampling rate s as in Lemma 6, we first need to check for optimality in (1) and we now show how to track convergence in algorithm 3 by computing a surrogate duality gap, at a cost roughly equivalent to that of computing a subgradient. The dual of problem (1) is written maximize Tr(CX) − S Q (w) subject to w j = b j − Tr(A j X), j = 1, . . . , p X tr ≤ 1, in the variables X ∈ S n and w ∈ R p , where S Q (v) is the support function of the set Q, defined as For instance, when Q is an Euclidean ball of radius B, problem (11) becomes maximize Tr(CX) − B w 2 subject to w j = b j − Tr(A j X), j = 1, . . . , p X tr ≤ 1, in the variables X ∈ S n and w ∈ R p . The leading singular vector v in algorithm 3 always satisfies vv T tr ≤ 1, hence we can track convergence in solving (1) by computing the following surrogate duality gap p j=1 where

Minimizing the sum of the k largest singular values
Motivated by applications in statistical learning, we now discuss direct extensions of the results above to the problem of minimizing the sum of the k largest singular values of an affine combination of matrices, written in the variable y ∈ R p , with parameters A j ∈ S n , for j = 1, . . . , p, b ∈ R p and C ∈ S n . As in the previous section, we also form its stochastic approximation in the variable y ∈ R p , with 1 ≤ s ≤ n controlling the sampling rate. We now prove an analog of Lemma 3 for this new objective function.

Proof. Because Rank(SS T ) ≤ Rank(XX T ) by construction, we always have
where r = min {k, Rank(X)}. Because the sum of the k largest singular values is a unitarily invariant norm on S n (see [HJ91, §3.4]), Mirsky's theorem (see [SS90,Th. 4.11] for example) shows that and because, by construction, the range of SS T is included in the range of XX T , we must have

Jensen's inequality together with the matrix multiplication result in Lemma 1 yield
with probability at least 1 − β. Combining these inequalities with the sampling rate in (16) s = η 2 X 4 F Rank(X) ǫ 2 σ r (X) 2 and using X 4 yields the desired result.
Once again, the subsampling rate in the above lemma has a clear interpretation, is the product of a term representing relative precision, a term reflecting the rank of X and a term in κ(X) representing its (pseudo) condition number. Note that the bound can be further refined when σ r ≤ ǫ. Lemma 7 allows us to compute the gradient by subsampling when using algorithm 3 to solve problem (14). The remaining steps in the algorithm are identical, except that the matrix vv T is replaced by a combination of matrices formed using the k leading singular vectors.

Applications & numerical results
In this section, we first detail a few instances of problem (1) arising in statistical learning. We then study the numerical performance of the methods detailed here on large scale problems.

Spectral norm minimization
For a given matrix A ∈ S n , we begin by studying a simple instance of problem (1) written in the matrix U ∈ S n . This problem is closely related to a relaxation for sparse PCA (see [dEGJL07]) and we use it in the next section to test the numerical performance of algorithm 3. The complexity of the main step in the algorithm (i.e. computing the gradient) is controlled by the sampling rate in Lemma 3, which is written where U * ∈ S n is the optimal solution to problem (17).

Matrix factorization and collaborative filtering
Matrix factorization methods have been heavily used to solve collaborative filtering problems (e.g. the Netflix problem) and we refer the reader to [Sre04], [Bac07], [RFP07] or [CR08] for details. All these references form a particular instance of problem (14), written in the variable y ∈ R p where Q is a low dimension norm ball for example and the matrices A j have a block format with only a few nonzero coefficients. Here, the trace norm can be understood as a convex lower bound on the rank function (as in [FHB01]) but sometimes also has a direct interpretation in terms of learning (see [Sre04]). In this particular case, the complexity of the main step in the algorithm (i.e. computing the gradient) is controlled by the sampling rate in Lemma 7, which can be simplified here to . The bound can be further refined when σ r ≤ ǫ. In practice, the complexity of solving problem (18) can often be further reduced using the simple observation that an optimal solution of (14) will also be optimal in (18) whenever Rank(Y * k ) < k, where Y * k is the optimal solution to (14) here. Once again, the sampling rate s has a natural interpretation as the product of a relative precision term, a term reflecting the condition number of the solution and the rank of the optimal solution. It means in particular that problems whose solutions have a lower rank are explicitly easier to solve than problems with more complex solutions.

LASSO
Consider a particular instance of problem (14) written in the variable y ∈ R n , with A ∈ R m×n , b ∈ R m and σ > 0. This is a (somewhat trivial) version of problem (14), where the matrices are diagonal and Q is an ellipsoid. This problem is directly related to LASSO, i.e. ℓ 1 -penalized regression (see [Tib96]). In the diagonal case, the low rank matrix approximation produced by algorithm 2 simply picks s coefficients of y with probability proportional to their magnitude. Here, computing the gradient is trivial, but computational savings come from the fact that the bound on the quadratic variation of the gradient in (5) is now equal to the sampling rate, so M * = s in the complexity estimate (9). Since s is chosen as above, with s = η 2 y * 1 ǫ 2 κ(y * ) 2 Card(y * ) where κ(y * ) = y [1] /y [r] with r = Card(y * ), this means that the complexity of solving problem (19) is (explicitly) proportional to the cardinality of the solution Card(y * ). Of course, this algorithm is not competitive with specialized algorithms for solving (19), but this subsampling bound provides some theoretical support for the empirical observations made in [DT06] using homotopy methods.

Fastest mixing Markov chain on a graph
As in [BDX04], suppose we are given a connected graph with vertex set V = {1, . . . , n} and edge set E ⊆ V × V, with (i, j) ∈ E ⇔ (j, i) ∈ E, where all vertices have a self-loop. We define a Markov chain on this graph with transition probability matrix P ∈ R n×n , where P ij = Prob(X t+1 = j|X t = i), i, j = 1, . . . , n This matrix satisfies P = P T and P 1 = 1, which means that the equilibrium distribution of this Markov chain is uniform and the largest singular value of P is equal to one. The asymptotic rate of convergence of this Markov chain to its equilibrium distribution is controlled by the second singular value of P , with smaller values of σ 2 (P ) producing faster convergence. [BDX04] exploited this property to show that the fastest mixing Markov chain on the graph (V, E) could be computed by minimizing σ 2 (P ) over all possible transition matrices on the graph, i.e. by solving minimize σ 2 (P ) subject to P ≥ 0, P 1 = 1, P = P T , in the variable P ∈ R n×n . The optimal mixing rate is often significantly faster than the rate provided by classical chains such as maximum degree or Metropolis-Hastings. Because σ 1 (P ) = 1 here, this is a particular instance of problem (14) where k = 2 and the matrices A j are sparse. Projections on Q can be handled as in [BDX04]. Once again, the complexity of the main step in the algorithm (i.e. computing the gradient) is controlled by the sampling rate in Lemma 7, which simplifies to s = η 2 1 ǫ 2 σ 2 (P * ) 2 NumRank(P * ) 2 Rank(P * ) where P * is the transition matrix of the fastest mixing Markov chain. Once again, simpler transition matrices mean faster convergence.

Numerical experiments
In this section, we test the quality of the subsampling approximations detailed in Section 2 on various matrices. We also evaluate the performance of the algorithms detailed above on large scale problem instances. Numerical code reproducing these experiments is available from the author's webpage.
Randomized low-rank approximations. Here, we first measure the quality of the randomized low-rank matrix approximation on both randomly generated matrices and on covariance matrices formed using gene expression data. Because the spectrum of naive large scale random matrices is very structured, these examples are too simple to appropriately benchmark numerical error in algorithm 2. Fortunately, as we will see below, generating random symmetric matrices with a given spectral measure is straightforward.
Suppose X ∈ S n is a matrix with normally distributed coefficients, X ij ∼ N (0, 1), i, j = 1, . . . , n. If we write its QR decomposition, X = QR with Q, R ∈ R n×n , then the orthogonal matrix Q is Haar distributed on the orthogonal group O n (see [Dia03] for example). This means that to generate a random matrix with given spectrum µ ∈ R n , we generate a normally distributed matrix X, compute its QR decomposition and the matrix Q diag(µ)Q T will be uniformly distributed on the set of symmetric matrices with spectrum µ. Because the spectral measure of "natural" covariance matrices often follows a power law (Tracy-Widom in the Gaussian case, see [Joh01] and [EK07] for a discussion), we sample the spectrum µ from a beta distribution with various exponents to get realistic random matrices with a broad range of numerical ranks. We also use a covariance matrix formed using the gene expression data set in [ABN + 99].
In Figure 1, we plot relative error ǫ/ X 2 versus numerical rank NumRank(X) in loglog scale with 20% subsampling and n = 500 on random matrices generated as above and on the gene expression covariance from [ABN + 99]. We notice that, on these experiments, the relative error grows at most linearly with the numerical rank of the matrix, as predicted by Lemma 3. We then plot the histogram in semilog scale of relative error ǫ/ X 2 over theoretical bound η NumRank(X)/ √ s for random matrices with n = 500. In Figure 2, we plot relative error ǫ/ X 2 versus sampling rate s, in loglog scale, for a gene expression covariance with n = 500. Once gain, the error decreases as 1/ √ s as predicted by Lemma 3. We also plot the median speedup factor (over ten runs) in computing largest magnitude eigenvalues using ARPACK with and without subsampling on a gene expression covariance matrix with n = 2000, for various values of the sampling ratio s/n. Note that both exact and subsampled eigenvalues are computed using direct MEX calls to ARPACK by [LSY98], as eigs (MATLAB's interface to ARPACK) carries a massive overhead. In all the experiments above, the confidence level used in computing η was set to 99%.
Stochastic approximation with subsampling. In Figure 3, we generate a sample ratings matrix X = V V T for the collaborative filtering problem in §4.2, where V is a discrete feature matrix V ∈ {0, 1, 2} 100×3 . We "observe" only 30% of the ratings and solve problem (14) with k = 4 to approximately reconstruct the full ratings matrix. We plot objective value versus CPU time in seconds for this sample matrix factorization problem, using a stochastic approximation algorithm with deterministic gradient or the subsampled gradient algorithm 3 with subsampling ratio s 1 /n set at 20%. We also plot surrogate duality gap versus CPU time on the same example. We notice that while the subsampled algorithm converges much faster than the deterministic one, the quality Error / Theoretical error # occurences Figure 1: Left: Loglog plot of relative error ǫ/ X 2 versus numerical rank NumRank(X) with 20% subsampling and n = 500 on random matrices (blue dots) and gene expression covariance (red square). The dashed line has slope one in loglog scale. Right: Histogram plot in semilog scale of relative error ǫ/ X 2 over theoretical bound η NumRank(X)/ √ s for random matrices with n = 500.  of the surrogate dual points and duality gap produced using subsampled gradients as in §3.3 is worst than in the deterministic case.
In Table 2, using the same 20% sampling rate we compare CPU time versus problem dimension n for subsampled and deterministic algorithms when solving the following instance of problem (1) minimize C + X 2 subject to X ∞ ≤ ρ in the variable X ∈ S n where C is a covariance matrix constructed using a subset of size n of the variables in [ABN + 99], for various values of n. Finally, we generate and solve sample collaborative filtering problems as in (14) for ratings matrix of various dimensions n. We report median CPU time over ten sample problems in Table 3. Here, subsampling speeds up the algorithm by an order of magnitude, however the stochastic approximation algorithm is still not competitive with (non convex) local minimization techniques over low rank matrices.

Appendix
The complexity results detailed above heavily rely on the fact that extracting one leading eigenvector of a symmetric matrix X ∈ S n can be done by computing a few matrix vector products. While this simple fact is easy to prove using the power method when the eigenvalues of X are well separated, the problem becomes significantly more delicate when the spectrum of X is clustered. The section that follows briefly summarizes how modern numerical methods solve this problem in practice.

Computing one leading eigenvector of a symmetric matrix
We start by recalling how packages such as LAPACK [ABB + 99] form a full eigenvalue (or Schur) decomposition of a symmetric matrix X ∈ S n . The algorithm is strikingly stable and, despite its O(n 3 ) complexity, often competitive with more advanced techniques when the matrix X is small. We then discuss the problem of approximating one leading eigenpair of X using Krylov subspace methods with complexity growing as O(n 2 log n) with the dimension (or less when the matrix is structured).
Full eigenvalue decomposition. Full eigenvalue decompositions are computed by first reducing the matrix X to symmetric tridiagonal form using Householder transformations, then diagonalizing the tridiagonal factor using iterative techniques such as the QR or divide and conquer methods for example (see [Ste01,Chap. 3] for an overview). The classical QR algorithm (see [GVL90,§8.3]) implicitly relied on power iterations to compute the eigenvalues and eigenvectors of a symmetric tridiagonal matrix with complexity O(n 3 ), however more recent methods such as the MRRR algorithm by [DP03] solve this problem with complexity O(n 2 ). Starting with the third version of LAPACK, the MRRR method is part of the default routine for diagonalizing a symmetric matrix and is implemented in the STEGR driver (see [DPV06]). Overall, the complexity of forming a full Schur decomposition of a symmetric matrix X ∈ S n is then 4n 3 /3 flops for the Householder tridiagonalization, followed by O(n 2 ) flops for the Schur decomposition of the tridiagonal matrix using the MRRR algorithm.
Computing one leading eigenpair. We now give a brief overview of the complexity of computing leading eigenpairs using Krylov subspace methods and we refer the reader to [Ste01, §4.3], [GVL90, §8.3, §9.1.1] or [Saa92] for a more complete discussion. Let u ∈ R n be a given vector, we form the following Krylov sequence u, Xu, X 2 u, . . . , X k u by computing k matrix vector products. If we call K k (X, u) the subspace generated by these vectors and write X = n i=1 λ i x i x T i a spectral decomposition of X, assuming, for now, that one can show using Chebyshev polynomials (see e.g. [Ste01, §4.3.2] for details) that tan ∠ (x 1 , K k (X, u)) tan ∠(x 1 , u) in other words, Krylov subspaces contain excellent approximations of leading eigenpairs of X. This result is exploited by the Lanczos procedure to extract approximate eigenpairs of X called Ritz pairs (see [GVL90,Chap. 9] or [ §5.1.2][Ste01] for a complete discussion). In practice, the matrix formed by the Krylov sequence is very ill-conditioned (as X k u gets increasingly close to the leading eigenvector) so the Lanczos algorithm simultaneously updates an orthogonormal basis for K k (X, u) and a partial tridiagonalization of X. The Lanczos procedure is described in Algorithm 4 and requires k matrix vector products and an additional 4nk flops. Note that the only way in which the data in X is accessed is through the matrix vector products Xu j .

4:
Set α j = u T j v.

7:
Set u j+1 = v/β j . 8: end for Output: A Lanczos decomposition where U k ∈ R n×k is orthogonal and T k ∈ S k is symmetric tridiagonal.
In theory, one could then diagonalize the matrix T k (which costs O(k 2 ) as we have seen above) to produce Ritz vectors. In practice, key numerical difficulties often arise. First, finite precision arithmetics cause a significant loss of orthogonality in U k . This is remedied by various reorthogonalization strategies (cf. [Ste01, §5.3.1]). A more serious problem is clustered or multiple eigenvalues in the spectrum periphery. In fact, it is easy to see that Krylov subspace methods cannot isolate multiple eigenvalues. Assume that the leading eigenvalue has multiplicity two for example, we then have and everything happens as if the eigenvalue λ 1 was simple and the matrix X had a larger nullspace. This is not a problem in our case, since we need only one eigenvector in the leading invariant subspace, not the entire eigenspace.
Clustered eigenvalues (i.e. a small gap between the leading eigenvalue and the next one, not counting multiplicities) are much more problematic. The convergence of Ritz vectors cannot be established by the classical Chebyshev bounds described above, and various references provide a more refined analysis of this scenario (see [PSS82], [VdSVdV87], [KW92] among others). Successful termination of a deterministic Lanczos method can never be guaranteed anyway, since in the extreme case where the starting vector is orthogonal to the leading eigenspace, the Krylov subspace contains no information about leading eigenpairs. In practice, Lanczos solvers use random initial points. In particular, [KW92,Th.4.2] show that, for any matrix X ∈ S n (including matrices with clustered spectrum), starting the algorithm at a random u 1 picked uniformly over the sphere means the Lanczos decomposition will produce a leading Ritz pair with relative precision ǫ in k Lan ≤ log(n/δ 2 ) 4 √ ǫ iterations, with probability at least 1 − δ. This is of course a highly conservative bound and in particular, the worst case matrices used to prove it vary with k. This means that computing one leading eigenpair of the matrix X requires computing at most k Lan matrix vector products (we can always restart the code in case of failure) plus 4nk Lan flops. When the matrix is dense, each matrix vector product costs n 2 flops, hence the total cost of computing one leading eigenpair of X is O n 2 log(n/δ 2 ) 4 √ ǫ flops. When the matrix is sparse, the cost of each matrix vector product is O(s) instead of O(n 2 ), where s is the number of nonzero coefficients in X. Idem when the matrix X has rank r < n and an explicit factorization is known (which is the case in the algorithms detailed in the previous section), in which case each matrix vector product costs O(nr) which is the cost of two n by r matrix vector products, and the complexity of the Lanczos procedure decreases accordingly. The numerical package ARPACK by [LSY98] implements the Lanczos procedure with a reverse communication interface allowing the user to efficiently compute the matrix vector product Xu j . However, it uses the implicitly shifted QR method instead of the more efficient MRRR algorithm to compute the Ritz pairs of the matrix T k ∈ S k .

Other sampling techniques
For completeness, we recall below another subsampling procedure in [AM07]. More recent "volume sampling" techniques produce improved error bounds (some with multiplicative error bounds) but the corresponding optimal sampling probabilities are much harder to compute, we refer the reader to [KV09] for more details. The key idea behind this result is that, as the matrix dimension n grows and given a fixed, scale invariant precision target X F /ǫ, the norm X ∞ of individual coefficients in X typically becomes negligible and we can randomly discard the majority of them while keeping important spectral features of X mostly intact.
Lemma 8 Given X ∈ S n and ǫ > 0, we define a subsampled matrix S whose coefficients are independently distributed as: when i ≥ j, and S ij = S ji otherwise. Assume that 1 ≥ p ≥ (8 log n) 4 /n, then X − S 2 ≤ 4 X ∞ n/p.
At first sight here, bounding the approximation error means letting the probability p grow relatively fast as n tends to infinity. However, because X ∞ /ǫ is typically much smaller than X F /ǫ, this subsampling ratio p can often be controlled. Adaptive subsampling, i.e. letting p vary with the magnitude of the coefficients in X, can further improve these results (see [AM07,§4] for details). The average number of nonzero coefficients in the subsampled matrix can be bounded using the structure of X. Note that the constants in this result are all very large (in particular, 1 ≥ p ≥ (8 log n) 4 /n implies n ≥ 10 9 ) so despite its good empirical performance in low dimensions, the result presented above has to be understood in an asymptotic sense.