DETERMINISTIC AND STOCHASTIC PRIMAL-DUAL SUBGRADIENT ALGORITHMS FOR UNIFORMLY CONVEX MINIMIZATION

We discuss non-Euclidean deterministic and stochastic algorithms for optimization problems with strongly and uniformly convex objectives. We provide accuracy bounds for the performance of these algorithms and design methods which are adaptive with respect to the parameters of strong or uniform convexity of the objective: in the case when the total number of iterations N is ﬁxed, their accuracy coincides, up to a logarithmic in N factor with the accuracy of optimal algorithms.

1. Introduction. Let E be a (primal) finite-dimensional real vector space. In this paper we consider the optimization problem: where Q is a closed convex set in E and function f with domain Q is convex and Lipschitz-continuous on Q. Let · be a norm on E. Recall that a function f is called uniformly convex on Q ⊂ E with convexity parameters ρ = ρ(f ) ≥ 2 and µ = µ(f, ρ) if for all x and y from Q and any α ∈ [0, 1] we have x − y ρ . The function f which is uniformly convex with ρ = 2 is called strongly convex with respect to the norm · . Uniform convexity with 2 ≤ ρ ≤ ∞ and µ ≥ 0 implies usual convexity.
In this paper we discuss deterministic and stochastic first order algorithms for (large scale) non-Euclidean uniformly convex objectives, thus extending non-Euclidean first order methods (see, e.g. [9,13] and references therein) to uniformly convex optimization.
Uniformly convex functions have been introduced to optimization in [17] and extensively studied (cf. [2,3], and [20]). The worst-case complexity bounds for the problem (1.1) with the exact and stochastic first order oracle are available for the case of strongly convex objective (see, e.g. [18,1] and references therein). Specifically, for any method tuned to the absolute accuracy ǫ for the problem (1.1) with strongly convex, with parameter µ, and Lipschitz-continuous, with unit Lipschitz constant, objective and deterministic first order oracle, the number of calls to the oracle is not less than O(µ −1 ǫ −1 ) which is much better than the corresponding bound O(ǫ −2 ) for a larger class of Lipschitz-continuous convex functions. The corresponding bound for uniformly convex problems with the convexity parameters ρ and µ ρ ) (for the sake of completeness we provide in appendix A the corresponding bound for the case of the Euclidean norm · ). Note that in the case of the stochastic oracle these bounds hold also for problems with smooth objective.
Note that smooth uniformly convex deterministic optimization is "covered" within the Euclidean framework -it appears that the optimal deterministic first order algorithms of Euclidean smooth uniformly convex optimization developed in [8, chapter 7] and [10, chapter 2] retain their optimality in the non-Euclidean framework. Indeed, let us consider the problem (1.1) where f is a strongly convex quadratic form: f (x) = 1 2 x T Ax − b T x, the set Q = {x ∈ R n | x 1 ≤ 1}, and A is a symmetric n×n positive-definite matrix. Recall that the complexity estimate for optimal algorithms of strongly convex smooth optimization is O( √ λ log ǫ −1 ) where λ = L(f ) µ(f ) is the conditioning of the objective -the ratio of the Lipschitz constant L(f ) of the gradient of the objective and the parameter µ(f ) of strong convexity, and ǫ is the desired absolute accuracy. Note that the Lipschitz constant L 1 (f ) of the gradient of f with respect to the norm · 1 satisfies L 1 (f ) = A 1,∞ = max 1≤i,j≤n |A ij |. On the other hand, one may easily verify that the corresponding parameter µ(f ) of strong convexity of f is bounded from above with L 1 (f )n −1 , resulting in conditional number λ ≥ n. 1 Now recall that the Lipschitz constant of the gradient of f , when measured with respect to Euclidean norm is L 2 (f ) = A 2,2 = λ max (A) -the spectral norm of A, and L 2 (f ) ≤ nL 1 (f ). In other words, in this case, when passing from the Euclidean to non-Euclidean setup we gain nothing -the degradation of the strong convexity parameter 1 Here is the proof of this claim: let ξ = (ξ1, . . . , ξn) T be a random vector with i.i.d. components such that P (ξi = 1/n) = P (ξi = −1/n) = 1/2. Then ξ 1 = 1, and Observe that the bound 1/n is attained for the identity matrix A.
in the · 1 -setup outweighs the potential improvement of the conditioning due to the reduced Lipschitz constant in the · 1 -setup. On the other hand, although the optimal algorithms for optimization with strongly convex Lipschitz continuous objective in the Euclidean framework are readily available (see, e.g., [15,18]), they cannot be directly transposed to the non-Euclidean framework.
The results presented in this paper are not very new, as they were developed by the authors in [2004][2005]. However, because of the immediate lack of application and, more importantly, due to new first order methods based on smoothing of structured problems with better complexity characteristics which were developed in [11,12] at that time, the authors got an impression that new non-Euclidean algorithms of black-box (non-structured) uniformly convex optimization are of very limited interest. However, certain developments of the last years clearly demonstrated that in some situations the black-box methods are irreplaceable. Indeed, exact first order oracle are often unavailable, or the structure of a problem may be simply too complex for applying a smoothing technique. In particular, deterministic and stochastic non-Euclidean first order methods of convex optimization have attracted much attention lately in relation, in particular, with very large scale applications arising in statistics and learning. For instance, some new applications involving large scale strongly convex optimization has been recently reported (see, e.g., [7,19,6]). These considerations encouraged the authors to publish the above mentioned results on subgradient methods for uniformly convex problems.
In this paper we develop minimax optimal primal-dual minimization schemes in the spirit of [13] for uniformly convex problems as in (1.1) with Lipschitz-continuous objective. We also study the performance of multistage dual averaging procedures when applied to uniformly convex stochastic minimization problems. In particular, we show that such procedures attain the minimax rates of convergence on the considered problem class. We also provide confidence sets for approximate solutions of stochastic uniformly convex problems.
It is well known that performance of "classical" optimization routines for strongly (and uniformly) convex problems can become very poor when the parameters of strong (uniform) convexity are not known a priori (see, e.g. section 2.1 in [9]). In the case of deterministic and stochastic optimization we develop adaptive minimization procedures in the case when the total number N of the method iterations is fixed. The accuracy of these procedures (which do not require a priori knowledge of parameters of uniform convexity) coincides, up to a logarithmic in N factor, with the accuracy of optimal algorithms (which "know" the exact parameters). It is worth to note that we do not know if it is possible to construct adaptive optimization procedures tuned to the fixed accuracy with analogous proprieties.
The paper is organized as follows: in section 2 we define the basic ingredients of the minimization problem in question. Then we study the properties of the primal-dual subgradient algorithms in the problem with an exact deterministic oracle in section 3 and show how the dual solutions can be produced in section 4. In section 5 we develop optimal algorithms for stochastic uniformly convex optimization and show how confidence sets for approximate solutions can be constructed. Section 6 contains some details of computation aspects of proposed routines. Finally, in appendix A we present the lower complexity bound for a class of optimization problems with uniformly convex and Lipschitz continuous objectives; appendix B contains the proofs of the statements of the paper.

Problem statement and basic assumptions.
2.1. Notations and generalities. Let E * be the dual of E. We denote the value of linear function s ∈ E * at x ∈ E by s, x . For measuring distances in E, let us fix some (primal) norm · . This norm defines a primal unit ball The dual norm · * on E * is introduced, as usual, by For other balls in E we adopt the following notation: If a uniformly convex on Q function f is subdifferentiable at x ∈ Q, then 2 Note that the relationship (2.2) is sometimes used as definition of a uniformly convex function (see, e.g. [16]). However, (2.2) does not imply (2.1) and (1.2), but, instead of (2.1), for instance, it leads to Of course, in the strongly convex case we have ρ = 2 and both definitions lead to the same value of the modulus of strong convexity.

Problem statement.
We consider the optimization problem (1.1) with the uniformly convex function f with convexity parameters ρ(f ) and µ(f ). The basic assumption we make about the objective, and which is supposed to hold through the paper, is that f is Lipschitz-continuous on Q: We assume that f is subdifferentiable at any x ∈ Q, moreover, We are to study the performance of an iterative minimization schemes, and we consider two settings which differ with respect to the information available to the method at each iteration.
-deterministic setting: let x k be the search points at iteration k, k = 0, 1, . . . . We suppose that an exact subgradient observations g k = f ′ (x k ) and the exact objective values f (x k ) are available; -stochastic setting: the observation g k of the subgradient f ′ (x k ), requested by the method at the k-th iteration, is supplied by a stochastic oracle, i.e. g k is a random vector.
To be more precise, suppose that we are given the probability space (Ω, F, P ) and a filtration (F k ), k = −1, 0, 1, . . . (non-decreasing family of σ-algebras which satisfies "usual" conditions). Let is sequence of random vectors such that ς k is F k -measurable; • x k is the k-th search point generated by the method. We suppose that x k is F k−1 -measurable (indeed, x k is a measurable function of x 0 and observations g 1 , . . . , g k−1 at iterations 1, . . . , k − 1).
We also consider the following assumptions specific to the stochastic problem: Assumption 2.2. The oracle is unbiased. Namely, Here E k stands for the expectation conditioned by F k (then E = E −1 is the "full" expectation).
We will also use a stronger bound on the tails of the distribution of (ξ k ): Note that by the Jensen inequality (2.4) implies (2.3).

2.3.
Prox-function of the unit ball. Assume that we know a prox-function d(x) of the ball B. This means that d is continuous and strongly convex on B in terms of (1.2) with some convexity parameter µ(d) > 0. Moreover, we assume that Hence, in view of (2.1) we have An important characteristic of the prox-function is its maximal value on the unit ball: If the function d is growing quadratically, another important characteristics is its constant of quadratic growth C(d) which we define as the smallest C such that Example 1. Let E = R n and let B be a unit Euclidean ball in R n . We choose the norm · to be the Euclidean norm on R n , so that the function d(x) = x 2 2 /2 is strongly convex with µ(d) = 1 and C(d) = A(d) = 1/2.

Example 2.
Let again E = R n and let B be the standard hyperoctahedron in R n , i.e. a unit l 1 -ball: We take x = x 1 and consider for p > 1 the function d, The function d is strongly convex with µ(d) = O(1)n p−1 p , and for p = 1+ 1 ln n we have µ(d) = O(1)(ln n) −1 (see, e.g. [8]). Further, we clearly have A(d) = C(d) = 1/2. Note that norm-type prox-functions are not the only possible in the hyperoctahedron setting. Another example of prox-function of the l 1 -unit ball B, which is very interesting from the computational point of view, is as follows: In order to show that this function is strongly convex on the standard hyperoctahedron B = {x ∈ R n | x 1 ≤ 1}, we need the following general result.
Lemma 2.1. Let Q be a bounded closed convex set in E containing the origin. If function f (x) is strongly convex on Q with parameter µ ≥ 0, then its symmetrization  Note that d does not satisfy the quadratic growth condition (2.7). For z ∈ Q, consider the set This set can be equipped with a prox-function Thus, the prox-center of the set In what follows we need the objects: the function (2.9) and the prox-mapping Note that dom V z,R,β = E * . Let us mention some properties of function V z,R,β (cf. Lemma 1 [13]): • the function V z,R,β is convex and differentiable on E * . Moreover, its gradient is Lipschitz continuous with the constant R 2 βµ(d) : • For any s ∈ E * , 3. Deterministic methods for uniformly convex functions. We start with the description of the basic tool -the dual averaging procedure, which originates in [13].
3.1. Method of dual averaging. At each phase the dual averaging (DA) method will be applied to the following auxiliary problem: Its feasible set is endowed with the following prox-function: . Consider now the generic scheme of Dual Averaging as applied to the problem (3.1).
The process is terminated after N iterations. The resulting point is defined as follows: The result below underlies the following developments (cf. Theorem 1 of [13]): In view of (3.2) we have the following lemma: Under the premises of the lemma we can establish the following immediate bounds: Corollary 3.1. Let x * be an optimal solution of (3.1). Then for the choice we have the estimates: , .

3.2.
Multi-step algorithms. Now we are ready to analyze multistage procedures for uniformly convex functions. In this section we assume that the constants L, µ(f ), ρ and R 0 ≥ x * − x 0 are known. Let us fix ǫ > 0 and let x 0 be an arbitrary element of Q.
Note that the parameters of the algorithm satisfy the following relations: (3.6) Theorem 3.1. The points {y k } m k=1 generated by Algorithm 3.2 satisfy the following conditions: 3) Here ⌊a⌋ stands for the largest integer strictly smaller than a.
Moreover, f ( x ǫ (y 0 , R 0 )) − f * ≤ ǫ and the total number N (ǫ) of iterations in the scheme does not exceed An important particular case of Theorem 3.1 is the case of strongly convex objective f . In the latter case τ = 1 and the analytical complexity of Algorithm 3.2 does not exceed The method can be easily rewritten for the case when the total number N of calls to the oracle is fixed a priori.
and output the approximate solution x = x N (x, R 0 ). If N ≥N use the following procedure: . Output: Corollary 3.2. We have

3.3.
Methods with quadratically growing prox-function. We propose here a slightly different version of multi-stage procedures for the case when the prox-function satisfies the condition (2.7) of quadratic growth.
The result below is an immediate consequence of Proposition 3.1 (cf. Lemma 3.1 and Corollary 3.1): Corollary 3.3. Let x * be an optimal solution of (3.1). Suppose that the prox-function d satisfies (2.7) and that x − x * ≤ r ≤ R. Then the approximate solution x N (x, R), provided by Algorithm 3.1 with Indeed, to show (3.11) and (3.12) it suffices to use (3.3) and to observe that due to ( The following multi-stage scheme exploits the "scalability property" (2.7) of the prox-function d. It starts from arbitrary x 0 ∈ Q. As in the previous section, we assume that the constants L, µ(f ) and the diameter R 0 of Q are known.
Output: Set the approximate solution x ǫ = y m .
We would like to stress the difference between Algorithms 3.2 and 3.4: in Algorithm 3.4 the delation parameter R = R 0 of the prox-function d remains the same through all the stages of the method. Only the gain γ k and the duration N k of the stage depend on the stage index k. As a result, the proxmapping π z,R,β is easier to compute. Further, as we will see in section 5.1, it also allows a straightforward modification in the case of stochastic oracle.
We have the following analogue of Theorem 3.1 in this case: Then the approximate solution x N , provided by Algorithm 3.4 satisfies: The method can be rewritten when the total number N of calls to the oracle is fixed.
Termination: Set the approximate solution x N = y m(N ) .
The proof of the corollary is completely analogous to that of Corollary 3.2.
3.4. Adaptive algorithm. Consider the setting in which the total number N of calls to the oracle is fixed and suppose that the convexity parameters ρ and µ(f ) are unknown. We propose a multi-stage procedure which does not require the knowledge of these parameters and attains the accuracy of the method which "knows" the convexity parameters up to a logarithmic in N factor. Following the terminology used in statistics and control literature, we call such procedures adaptive (with respect to unknown parameters). In what follows we suppose that the bounds L and R 0 are known a priori.
We analyze here the following adaptive version of Algorithm 3.3 (we leave the construction and analysis of adaptive version of Algorithm 3.5 as an exercise to the reader): stands here for the largest integer less or equal to a Stage k = 1, . . . , m: .

Generating dual solutions.
In order to speak about primal-dual solutions, we need to fix somehow the structure of objective function in problem (1.1). Let us assume that where S is a closed convex set, and function Ψ is convex in the first argument x ∈ Q and concave in the second argument u ∈ S. Let us assume that Ψ is subdifferentiable in x at any (x, w) ∈ Q × S. Then we can take Thus, we can define the dual function η(w) = min x∈Q Ψ(x, w), and the dual maximization problem For any w ∈ S, we assume that Ψ(·, w) is uniformly convex on Q with convexity parameters ρ = ρ(Ψ) and µ(Ψ)).
Let for x, w ∈ R n and let Clearly, Ψ(x, w) is convex in x and concave in w. Further, Furthermore, when the objective f is strongly convex (ρ = 2), 5. Stochastic programming with uniformly convex objective. In order to rewrite the results of sections 3 in the stochastic framework we substitute for f ′ (x k ) its observation g k = g(x k , ς k ) into the iteration of Algorithm 3.1. The following statement is a stochastic counterpart of Proposition 3.1: Proposition 5.1. Let x k , k = 0, 1, . . . be the search points of Algorithm 3.1 with g k substituted for f ′ (x k ). Then for any x ∈ Q ∩ B R (x), In this section we propose two families of multi-stage methods for uniformly convex stochastic programming problem described in section 2.2. The first one is based on the dual averaging scheme with the prox-function which satisfies the condition (2.7) of quadratic growth. As we have already mentioned, one can easily obtain the bounds for the average value of the objective at the approximate solution, generated by the stochastic counterpart of Algorithm 3.4 and 3.5. On the other hand, the methods derived from those, presented in section 3.2, better suit the case when the confidence bounds on the error of the approximate solutions are required.

Expectation bounds for methods with prox-function of quadratic growth.
When taking the expectation with respect to the distribution of ξ i we obtain the following simple counterpart of Lemma 3.1: we get the following (cf Corollary 3.3) and let When comparing the above statement to the result of Corollary 3.3 we observe that the only difference between the two is that in Corollary 5.1 the quantity L 2 is substituted with L 2 + σ 2 . When modifying in the same way the parameters of Algorithm 3.5 we obtain the multistage procedure for the stochastic problem.
Assume that the parameters L, ρ, µ(f ) and the diameter R 0 of Q are known. The method starts from an arbitrary x 0 ∈ Q.
. Output: Set the approximate solution x ǫ = y m .
We have the following stochastic analogue of Theorem 3.2: Then the approximate solution x N , provided by Algorithm 5.1 satisfies: The proof of the theorem follows the lines of that of Theorem 3.2. It suffices to substitute the bounds (5.3) and (5.4) for those of (3.11) and (3.12). We leave this simple exercise to the reader.
The method can be rewritten for the case when the total number N of calls to the oracle is fixed.
Exactly in the same way it was done in the deterministic settings, we can provide an adaptive version of the method. To this end the adaptive method of Algorithm 3.6 for deterministic problem should be slightly modified: we have to change the way the approximate solution x N is formed, as the exact observations of the objective function are not available anymore. Fortunately, we can take as the output of the algorithm the approximate solution y m , generated at the last stage.
Consider the following procedure: .
Termination: Set the approximate solution x N = y m .
Theorem 5.2. The approximate solution x N , supplied by Algorithm 5.3, satisfies for N > 4: .

5.2.
Confidence sets for uniformly convex stochastic programs. In this section we establish confidence bounds for the approximate solutions, delivered by multistage stochastic algorithms. Consider dual averaging Algorithm 3.1 in which we substitute the exact subgradient with the observation g k = h(x k ) + ξ k , where h(x k ) = E k−1 g k ∈ ∂f (x k ). Let δ N (x, R) be the gap value, defined in (3.4).
Then for all α ≥ 0, the approximate solution x N (x, R) of Algorithm 3.1 satisfies Corollary 5.3 allows us to compute the confidence sets for approximate solutions, provided by stochastic analogues of Algorithms 3.2 and 3.3 exactly in the same way as it was done in section 3.2. For the sake of conciseness we present here only the result for the setting when the total number N of subgradient observations is fixed and the convexity parameters of the objective are unknown.
where ǫ(N, α) = 4 16 6. Computational issues. The interest of the proposed algorithmic schemes is conditioned by our ability to compute efficiently the optimal solution π z,R,β (s) of the optimization problem (2.9). We present here two important examples in which the problem (2.9) can be solved quite efficiently. These are the standard simplex and the hyperoctahedron settings.
Let us measure the distances in E = R n in l 1 -norm: 6.1. Simplex setup. Let n ≥ 2 and let be the standard simplex. We are to show how the problem (2.9) can be solved in this case. The problem (2.9) on Q R (z) for the function d as in (2.8) writes When eliminating the "x" variable and dualizing the coupling constraints we obtain the equivalent problem max λ,µ The dual problem (6.2) can be solved using a conventional method of convex optimization (ellipsoid or level), given the solution of the problem Note that the latter problem an be decomposed into n 2-dimensional problems One way to compute the minimizer is to compute the solution (ū,v) to the problem and to see if the subgradient If this is the case, we takeū,v as the minimizers, if not, the inequality constraint is not active at the optimal solution of (6.2) and we takē 6.2. Hyperoctahedron setup. Let now Q be a standard hyperoctahedron: Let us see how the solution to (2.9) can be computed in this case.
When writing the problem (2.9) on Q R (z) can be rewritten as When dualizing the coupling constraints we come to max λ,µ L(λ, µ) ≡ min u,v,w,y L(u, v, w, y, λ, µ) : The computation of the dual function L(λ, µ) boils down to evaluating solutions to n subproblems It is obvious that either w or y vanishes, and to find the solution to (6.3) it suffices to compare the optimal values of the problems which are the same problems as (6.2) in the previous section.
Acknowledgements. The authors would like to acknowledge insightful and motivating comments of Prof. Peter Glynn, which were extremely helpful to them upon completion of this paper.

APPENDIX A: LOWER COMPLEXITY BOUND FOR UNIFORMLY CONVEX OPTIMIZATION
For the sake of simplicity we consider here the minimization problem over the domain Q which is an Euclidean ball: The lower bound below can be reproduced for domains of different geometry when following the construction in [8, chapter 3]. Let F R (L, ρ) be a class of Lipschitz continuous and uniformly convex functions on Q, with Lispchitz constant L and parameters of uniform convexity ρ and µ(ρ) = 1, when measured with respect to the Euclidean norm. Note that each problem (A.1) from the class is solvable; we denote f * corresponding optimal value.
We equip F R (L, ρ) with a first order oracle and define the analytical complexity A(ǫ) of the class in the standard way: where the (analytical) complexity A(ǫ, M) of a method M is the minimal number of oracle calls (steps of M) required by M to solve any problem of the class F R (L, ρ) to absolute accuracy ǫ -find an approximate solutionx such that f (x) − f * ≤ ǫ.
Then the analytical complexity A(ǫ) of the class F R (L, ρ) admits the lower bound: (here ⌊·⌋ stands for the integer part).
Proof. The proof of the lower bound reproduces the standard reasoning of [8, chapter 3]. It suffices to prove that if ǫ ∈ (0, 1) is such that then the complexity A(ε) is at least M . Assume that this is not the case, so that there exists a method M which solves all problems from the family in question in no more than M − 1 steps. We assume that M solves any problem exactly in M steps, and the result always is the last search point. Let us set so that δ > 0 by definition of M . Now for λ > 0 consider the family F 0 comprised of functions where ξ i ∈ {±1} and 0 < d i < δ, i = 1, . . . , M . Note that all functions of the family are well-defined, since M ≤ n. Furthermore, by (A.2) f is Lipschitz-continuous with Lipschitz constant ≤ L, and by Lemma 4 of [16] the function 2 ρ−3 x ρ 2 is uniformly convex with corresponding parameters ρ and µ = 1, thus f (x) are uniformly convex with parameters ρ and µ(f ) = 1.
Let us consider the following construction. Let x 1 be the first search point generated by M; this point is instance-independent. Let i 1 be the index of the largest in absolute value of the coordinates of x 1 . We set ξ * i 1 to be the sign of the coordinate and put d * It is clear that all the functions of the family F 1 possess the same local behavior at x 1 and are positive at this point. Now let at the step k + 1 i k+1 be the index of largest in absolute value of the coordinates of x k+1 with indices different from i 1 , . . . , i k . We define ξ * i k+1 as the sign of the coordinate, put d * i k+1 = 2 −(k+1) δ, and define F k+1 as the set of those functions from F k for which It is immediately seen that the family F k+1 satisfies the predicate: P k : the first k + 1 points x 1 , . . . , x k+1 of the trajectory of M as applied to a function from the family do not depend on the function, and all the functions from the family coincide with each other in a certain neighborhood of the k + 1-point set {x 1 , . . . , x k+1 } and are positive at this set.
Observe that after M steps we end up with the family F M which consists of exactly one function such that f is positive along the sequence x 1 , . . . , x M of search points generated by M as applied to the function. Let now where e i stand for basic orths of R n , and inequality follows from (A.3)). In the case of λ = R/ √ M we have Thus, in both cases we have Since, by construction, x M is the result obtained by M as applied to f , we conclude that M does not solve the problem f within relative accuracy ε, which is the desired contradiction with the origin of M .

APPENDIX B: PROOFS
B.1. Proof of Lemma 2.1. Consider two points x i ∈ Q 0 , i = 1, 2. Suppose that Let us choose an arbitrary α ∈ [0, 1]. Then, Note that u i = α iūi , and v i = (1 − α i )v i for someū i andv i from Q, i = 1, 2. Therefore, denoting we obtain with someū 3 andv 3 from Q. Hence, u 3 = γū 3 ∈ γQ, and v 3 = (1 − γ)v 3 ∈ (1 − γ)Q. Consequently, by definition of function f 0 and using inclusions u i , v i ∈ Q, i = 1, 2, we obtain It remains to note that B.2. Proof of Lemma 3.1. In view of conditions of the lemma, x * ∈ Q R (x). From the assumptions on function f , we conclude that Hence, It remains to note that dx ,R (x) ≤ A(d) for any x ∈ Q R (x) use the inequality (3.3).
B.3. Proof of Theorem 3.1. Indeed, for k = 0, (3.7) is valid. Assume it is valid for some k ≥ 0. Note that Therefore, in view of Proposition 3.1 and Corollary 3.1, we have and this is (3.8) for the next value of the iteration counter. Further, and this is (3.7) for k + 1. Finally, at the end of the m-th stage, in view of Lemma 3.1 and (3.8) we have The complexity of the method can be estimated as follows: To conclude (3.9) it suffices to notice that by (3.6), 2 m+1 ≤ 4

B.4. Proof of Corollary 3.2.
In the case N ≤N the corollary follows from the bound of Corollary 3.1 for one-stage method. When N ≥N , when following the steps of the proof of Theorem 3.1 we conclude that Now it suffices to notice that the number m(N ) of the stages of the algorithm can be easily bounded: Thus, and the bound (3.10) follows.
B.5. Proof of Theorem 3.2. As in the proof of Theorem 3.1, the result of the theorem follows immediately from the relations: Indeed, using the relations above we write: Let us verify the bounds (B.1) and (B.2). Assume that (B.1) valid for some k ≥ 0. Note that .
Therefore, in view of Corollary 5.1, we have B.6. Proof of Theorem 3.3. Note that by (2.6) m satisfies besides this, , what implies the statement of the theorem in this case. Next, let us denote and µ k = 2 (ρ−1)k µ 0 , k = 1, . . . , m. Observe that from the available information we can derive an upper bound on the unknown parameter µ(f ), namely, Suppose now that the true µ(f ) satisfies µ 0 ≤ µ(f ) ≤ µ m . We need the following auxiliary result.
k=1 generated by Algorithm 3.6 satisfy the following relations: Proof. Let us prove first (B.5) and (B.6). Indeed, for k = 1 (B.5) is valid. Assume it is valid for some k ≥ 1. We write Therefore, That is (B.6). Moreover, and this is (B.5) for the next index value. Further, as in (B.8), for k > k * we have This proves the lemma.
Now we can finish the proof of the theorem. Recall that µ 0 ≤ µ(f ) ≤ µ m . At the end of the k * -th stage we have . B.7. Proof of Theorem 4.1. The following result is quite standard (cf. Lemma 3 [14]).
Proof. Since Ψ is convex in the first argument, for any x ∈ Q we have Hence, Let us prove now several auxiliary results. Let l(x) be an affine function on E. Let us fix a pointȳ ∈ Q. Consider the function Note that ψ(r) is an increasing concave function of r and ψ(r) ≥ ψ(0) = l(ȳ).
Let us fix somer > 0 and choose an arbitraryx ∈ Qr(ȳ). For some µ > 0 define We need to bound from above the value λ * µ (x).
Then for every Λ ≥ 0 For the proof of the lemma see, e.g. section 4.2 of [5]. Let us return to the proof of the proposition. From (5.2) and Assumption 2.4 we conclude that When substituting β i = γ √ N + 1 we conclude (5.5) from (5.1).
Lemma B.6. Let k * satisfy µ k * ≤ µ(f ) ≤ 2 ρ−1 µ k * . Then for any 1 ≤ k ≤ k * , there exists a set A k ⊂ Ω of probability at least 1 − kᾱ such that for ω ∈ A k the points {y k } m k=1 generated by Algorithm 5.4 satisfy Further, for k > k * there is a set C k ⊂ Ω of probability at least 1 − (k − k * )ᾱ such that on C k f (y k ) ≤ f (y k * ) + µ k * R ρ k * . (B.20) Proof. Note that for k = 1 (B.18) is valid. Assume it is valid for some k ≥ 1. Note that by (5.6) of Corollary 5.3 there exists a random set, let us call it B k , such that Prob[B k ] ≥ 1 −ᾱ and on B k , On the other hand, by our inductive hypothesis, y k−1 − x * ≤ R k−1 on A k−1 . Let A k = A k−1 ∩ B k . Note that and we have on A k : what is (B.19) and (B.18) for k + 1.