Open Access

Estimating a Function and Its Derivatives Under a Smoothness Condition

Eunji Lim
Eunji Lim
[email protected]
https://orcid.org/0000-0003-1008-7050
Decision Sciences and Marketing, Adelphi University, Garden City, New York 11530-0701
Search for more papers by this author

Eunji Lim

[email protected]

https://orcid.org/0000-0003-1008-7050

Decision Sciences and Marketing, Adelphi University, Garden City, New York 11530-0701

Search for more papers by this author

Published Online:2 May 2024https://doi.org/10.1287/moor.2020.0161

Abstract

We consider the problem of estimating an unknown function $f_{*} : R^{d} \to R$ and its partial derivatives from a noisy data set of n observations, where we make no assumptions about $f_{*}$ except that it is smooth in the sense that it has square integrable partial derivatives of order m. A natural candidate for the estimator of $f_{*}$ in such a case is the best fit to the data set that satisfies a certain smoothness condition. This estimator can be seen as a least squares estimator subject to an upper bound on some measure of smoothness. Another useful estimator is the one that minimizes the degree of smoothness subject to an upper bound on the average of squared errors. We prove that these two estimators are computable as solutions to quadratic programs, establish the consistency of these estimators and their partial derivatives, and study the convergence rate as $n \to \infty$ . The effectiveness of the estimators is illustrated numerically in a setting where the value of a stock option and its second derivative are estimated as functions of the underlying stock price.

1. Introduction

We are concerned with the problem of estimating an unknown function $f_{*} : R^{d} \to R$ and its partial derivatives over a domain ${[a, b]}^{d}$ of interest when there is no closed-form formula for $f_{*}$ ; hence, simulation must be conducted to estimate $f_{*} (x)$ at $x \in R^{d}$ , or only noisy data on $f_{*}$ are available. We make no assumptions about $f_{*}$ except that it has square integrable partial derivatives of order m.

To place the problem in a more rigorous setting, we consider the situation where we wish to estimate the unknown function $f_{*} : R^{d} \to R$ and its partial derivatives by taking observations $(X_{1}, Y_{1}), \dots, (X_{n},$ $Y_{n})$ satisfying

Y_{i} = f_{*} (X_{i}) + ε_{i}

for

i = 1, 2, \dots, n

, where

((X_{i}, ε_{i}) : 1 \leq i \leq n)

is a sequence of

{[a, b]}^{d} \times R

-valued independent and identically distributed (iid) random vectors satisfying

E (ε_{i} | X_{i}) = 0

and

E (ε_{i}^{2} | X_{i}) = σ^{2} < \infty

for

i = 1, 2, \dots, n

Under the assumption that $f_{*}$ has square integrable partial derivatives of order m with $m \geq 1$ , our goal is to estimate $f_{*}$ and its partial derivatives of order up to $m - 1$ as functions over ${[a, b]}^{d}$ from the data set $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ .

When the only information available on the unknown regression function $f_{*}$ is the fact that it is “smooth” so that it has square integrable partial derivatives of order m, one way to estimate $f_{*}$ is to fit a “smooth” function to the data set, that is, find the solution to the following problem:

Minimize E_{n} (f) ≜ \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - f (X_{i}))^{2}

(1)

over

f \in F

for some appropriately chosen set

F

of functions. The next question is “Which set is appropriate for

F

?” One natural candidate for

F

is the set

F^{'}

of functions

f : R^{d} \to R

whose partial derivatives of order m are square integrable, that is,

F^{'} = {f : R^{d} \to R : J (f) < \infty}

, where

J (f) ≜ \sum_{| α | = m} \int_{R^{d}} {D^{α} f (x)}^{2} d x,

$α = (α_{1}, \dots, α_{d})$ is a d-tuple of nonnegative integers, $| α |$ is the order of $α$ defined by $| α | = α_{1} + \dots + α_{d}$ , and $D^{α}$ denotes the partial differential operator defined by

D^{α} = \frac{\partial^{α_{1} + \dots + α_{d}}}{\partial x_{1}^{α_{1}} \dots \partial x_{d}^{α_{d}}} .

However, a question arises as to whether there exists a solution to (1). To guarantee the existence of a solution to (1), one of the properties that $F$ must have is the completeness with respect to a suitably chosen metric (or a pseudometric). It turns out that $F$ must be larger than $F^{'}$ to be complete. However, if we let $F$ be the set of “generalized functions” whose “weak” derivatives of order m are square integrable, then $F$ contains $F^{'}$ and is large enough to be a semi-Hilbert space in the $‖ \cdot ‖_{m}$ seminorm, where $‖ f ‖_{m} ≜ \sqrt{J (f)}$ ; see, for example, Meinguet [51, theorem 1 on p. 130]. (A weak derivative of a function f is a generalized version of the classical derivative in the sense that whenever the classical derivative $D^{α} f$ exists, it is also a weak derivative of f.) Hence, throughout this paper, we will assume $F = F_{m}$ , where $F_{m}$ is the space of generalized functions whose weak partial derivatives of order m are square integrable; see Section 2 of this paper and Oden and Reddy [54] for the precise definitions of $F_{m}$ , generalized functions, and weak derivatives. We will further restrict our attention to the functions f whose smoothness $J (f)$ is bounded by a certain constant. Thus, a natural candidate for the estimator of $f_{*}$ is the solution ${\hat{f}}_{n}$ to the following problem:

Problem (A) : {Minimize}_{f \in F_{m}} E_{n} (f) subject to J (f) \leq U_{n}

for some upper bound

U_{n}

J (f)

. As Proposition 1 of this paper suggests, the solution to Problem (A) exists under the assumption that

2 m > d

and

{X_{1}, \dots, X_{n}}

are

P_{m - 1}

-unisolvent. For definitions, see Assumption 1 in Section 3.

The question now becomes how to compute the solution to Problem (A) numerically. As Proposition 6 of this paper states, the solution to Problem (A) can be found by solving a convex programming problem. However, it is not clear how to compute or estimate $U_{n}$ from the data set. This necessitates another approach that can estimate $f_{*}$ using a parameter that can be directly estimated from the data set. As an alternative to Problem (A), we consider the solution ${\hat{g}}_{n}$ to the following problem:

Problem (B) : {Minimize}_{f \in F_{m}} J (f) subject to E_{n} (f) \leq S_{n}

for some upper bound

S_{n}

on the average of squared errors. Problem (B) is appealing from a computational point of view; it is intuitively acceptable that

S_{n}

should be close to

σ^{2}

, and

σ^{2}

can be readily estimated from the data set.

In the context of nonparametric regression techniques in the statistics literature, Problems (A) and (B) are often contrasted with the following formulation:

Problem (C) : {Minimize}_{f \in F_{m}} E_{n} (f) + λ J (f),

where

λ

is a constant called the smoothing parameter. The solution to Problem (C) is often referred to as the “penalized least squares estimator”; see Györfi et al. [27], Wahba [71], and the references therein. One weakness of the penalized least squares estimator is that its performance is highly sensitive to the choice of

λ

, and the question of how to choose the smoothing parameter

λ

has been a challenging one. Cross validation (Craven and Wahba [15]) has been a popular method for choosing the smoothing parameter, but it does not guarantee the consistency of the estimators of the derivatives as

n \to \infty

. Thus, when the primary purpose of a modeler is to estimate a derivative of

f_{*}

and he or she does not want to use any parameter that cannot be directly estimated from the data set, Problem (B) is preferred from the computational point of view.

This paper is motivated by a stock trader whose goal is to estimate “gamma,” which is the second derivative of the value, $f_{*}$ , of a stock option as a function of the underlying stock price. The only information that he has is the fact that gamma is a smooth function of the stock price. The trader can use Problem (C) to estimate $f_{*}$ and its second derivative, but he needs to estimate the smoothing parameter $λ$ . Cross validation does not guarantee the consistency of the estimator of the second derivative of $f_{*}$ as $n \to \infty$ . Moreover, the trader does not want to use any parameter that cannot be directly estimated from a data set. In this situation, Problem (B) better serves his purpose because its only parameter $S_{n}$ can be directly estimated from the data set. In fact, Problem (B) has been used widely within the numerical analysis community; see, for example, Cox [13, p. 530].

Despite its popularity in the numerical analysis community, little is known about how to compute ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ numerically and what statistical properties ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ possess. The contributions of this paper are showing that ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ are computable as solutions to convex programs, establishing the consistency of ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ as estimators of $f_{*}$ , establishing the consistency of $D^{α} {\hat{f}}_{n}$ and $D^{α} {\hat{g}}_{n}$ as estimators of $D^{α} f_{*}$ for $| α | = 1, \dots, m - 1$ , and computing the convergence rate of ${\hat{f}}_{n}$ when $n \to \infty$ . The key contributions of this paper can be summarized as follows.

We identify the relationship between Problems (A) and (B). We prove that for $U_{n}$ within a certain interval, the solution ${\hat{f}}_{n}$ to Problem (A) exists uniquely and becomes a unique solution to Problem (B) with $S_{n} = J ({\hat{f}}_{n})$ . Conversely, for $S_{n}$ within a certain interval, the solution ${\hat{g}}_{n}$ to Problem (B) exists uniquely and becomes a unique solution to Problem (A) with $U_{n} = E_{n} ({\hat{g}}_{n})$ .
We discuss the computational aspects of Problems (A) and (B). We show that the solutions to Problems (A) and (B) can be found by solving convex programming problems. There exist numerous efficient algorithms that can successfully find the solutions to convex programs (see, for example, Boyd and Vandenberghe [6], Zangwill [76]), so this enables us to easily compute the solutions to Problems (A) and (B).
We establish the consistency of ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ as estimators of $f_{*}$ and the consistency of $D^{α} {\hat{f}}_{n}$ and $D^{α} {\hat{g}}_{n}$ as estimators of $D^{α} f_{*}$ for $| α | = 1, \dots, m - 1$ as $n \to \infty$ . Our main results for Problem (A) state that if $U_{n} \geq J (f_{*})$ for n sufficiently large and ${lim sup}_{n \to \infty} U_{n} < \infty$ , we have
$\begin{array}{l} \sup_{x \in {[a, b]}^{d}} | {\hat{f}}_{n} (x) - f_{*} (x) | \to 0 almost surely (a . s .), \end{array}$
and we have
$\begin{array}{l} E {({\hat{f}}_{n} (X) - f_{*} (X))}^{2} \to 0 and E {(D^{α} {\hat{f}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0 \end{array}$
for $| α | = 1, \dots, m - 1$ as $n \to \infty$ under modest assumptions. On the other hand, our main results for Problem (B) state that under some conditions on ${\hat{g}}_{n}$ and $S_{n}$ , we have
$\begin{array}{l} \sup_{x \in {[a, b]}^{d}} | {\hat{g}}_{n} (x) - f_{*} (x) | \to 0 a . s ., \end{array}$
$\begin{array}{l} E {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} \to 0 and E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0 \end{array}$
for $| α | = 1, \dots, m - 1$ as $n \to \infty$ .
We compute the convergence rate of ${\hat{f}}_{n}$ as $n \to \infty$ . Specifically, when $U_{n} \geq J (f_{*})$ for n sufficiently large and ${lim sup}_{n \to \infty} U_{n} < \infty$ , we have
$\begin{array}{l} {\frac{1}{n} \sum_{i = 1}^{n} {({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}}^{1 / 2} = O_{p} (n^{- m / (2 m + d)}) \end{array}$
under the assumption that the $ε_{i}$ ’s are uniformly sub-Gaussian.

1.1. Literature Review

This paper is a contribution to the literature of nonparametric regression for smooth functions. Problem (C) has been studied extensively in the literature, whereas relatively less attention has been paid to Problems (A) and (B). For Problem (C), the existence of a solution is established by Duchon [19], and the convergence rates are derived by Cox [14], Rice and Rosenblatt [57], Utreras [67], and van de Geer [69]. Craven and Wahba [15] and Utreras [66] discuss how to choose the smoothing parameter through cross validation. Duchon [19] has shown that the solution to Problem (C) has an explicit representation as a sum of basis functions. However, this approach often requires solving a system of linear equations that involves a dense and highly ill-conditioned matrix (see Dierckx [17, p. 145]). Numerically more efficient algorithms are developed by Hutchinson and de Hoog [33] and Luo and Wahba [47]. Schoenberg [60] has shown that a solution to Problem (A) is also a solution to Problem (C) for some $λ$ appropriately chosen, whereas Kersey [35] and Reinsch [56] studied the relationships between Problems (C) and (B). Green et al. [24] computed convergence rates of the Laplacian smoothing estimator, which can be viewed as a discrete approximation of Problem (C). For further references, see, for example, de Boor [16], Dierckx [17], Green and Silverman [23], Györfi et al. [27], Wahba [71], and Wegman and Wright [74]. For Problem (A), when $d = 1$ , the existence and uniqueness of the solution are established by Schoenberg [60], and the convergence rates are derived by van de Geer [68]. When $d \geq 2$ , we could not find any work concerning how to compute the solution to Problem (A) numerically or what statistical properties the solution to Problem (A) possesses. For Problem (B), when $d = 1$ , the existence and uniqueness of the solution are established by Reinsch [56]. However, we could not find any work concerning how to compute the solution to Problem (B) and what statistical properties the solution to Problem (B) possesses either when $d = 1$ or when $d \geq 2$ . Therefore, the main contribution of this paper is to establish a theoretical foundation of Problems (A) and (B) for the case when $d \geq 1$ by suggesting a convex programming formulation, establishing consistency, and computing the convergence rate.

This paper is also a contribution to the literature on shape constrained estimation. Many researchers have studied the problem of estimating the unknown regression function $f_{*}$ when $f_{*}$ is known to possess shape properties, such as monotonicity or convexity. One of the most popular approaches is least squares estimation under shape restrictions. For instance, when $f_{*}$ is known to be monotone, one can fit a monotone function to the data set by solving the following quadratic program:

{Minimize}_{f} E_{n} (f) subject to f (X_{i}) \leq f (X_{j}) if X_{i} \leq X_{j} for 1 \leq i, j \leq n,

(2)

where

X_{i} \leq X_{j}

denotes coordinate-wise monotonicity. The solution to (2) is referred to as the isotonic regression estimator and has well-established theory. Brunk [7] and Brunk [8] introduced isotonic regression, established consistency, and computed convergence rates when

d = 1

. Mammen [48] considered estimation of a smooth monotone function when

d = 1

. Lim [44] established consistency of isotonic regression and computed its convergence rate when

d \geq 1

and

f_{*}

is possibly misspecified. On the other hand, when

f_{*}

is known to be convex, the constraint of (2) can be replaced by the convexity condition, and the solution to the corresponding quadratic program is referred to as the convex regression estimator. Hildreth [30] first introduced convex regression when

d = 1

; Hanson and Pledger [28] established consistency when

d = 1

; Groeneboom et al. [26] computed the rate of convergence when

d = 1

; Kuosmanen [40] formulated the convex regression estimator as the solution to a quadratic program when

d \geq 1

; Lim and Glynn [46] and Seijo and Sen [61] established consistency of the multivariate convex regression estimator; Lim [44] established consistency of convex regression and computed its convergence rates when

d \geq 1

and

f_{*}

is possibly misspecified; Bertsimas and Mundru [4] and Lee et al. [43] proposed computationally efficient algorithms for estimating a concave monotone function and a convex function, respectively; Lim [44] and Yagi et al. [75] proposed hypothesis tests for detecting monotonicity and convexity; Kuosmanen and Johnson [42] discussed an interesting application of production function estimation and proposed a shape constrained estimator for such an application; and Kuosmanen and Johnson [41] established connection between data envelopment analysis and least squares estimation under shape restrictions. Two drawbacks of the convex regression estimator are that (1) it becomes computationally expensive when n increases and that (2) it tends to overfit the data near the boundary of the domain; see, for example, Lim [45, p. 70]. To overcome these shortcomings, Yagi et al. [75] proposed a local polynomial kernel estimator with shape constraints, and Keshvari [36] fitted a convex piecewise linear function to data where the number of linear segments is prespecified. An advantage of Problem (B) is that one can readily compute an estimate of

S_{n}

using sample variances. In the context of optimizing behavior in economics, Varian [70] similarly used sample variances to set bounds for acceptable deviations from the fitted function. For a comprehensive survey, see Johnson and Jiang [34] and the references therein. Although the previous literature has focused on monotonicity and convexity constraints, this paper demonstrates that one can work with the smoothness condition only. In particular, Problem (A) replaces the constraint of (2) by a condition on the degree of smoothness (i.e.,

J (f) \leq U_{n}

). Therefore, this paper can be viewed as an addition to the literature on nonparametric regression subject to shape constraints.

This paper can be viewed as a contribution to the literature on metamodel estimation or response surface estimation, which has been an important subject of research in the simulation community. A common goal in this setting is to fit a metamodel to a simulation output, where the simulation runs are time consuming. Therefore, most work has focused on a setting where n is relatively small. One of the earlier works is the response surface methodology that is based on the idea of fitting a polynomial function to the data set locally; see, for example, Myers and Montgomery [52]. When the $Y_{i}$ ’s are possibly correlated, the kriging-based methods serve as good alternatives; see, for example, Ankenman et al. [3]. The kriging-based methods, in general, do not assume that the regression function $f_{*}$ is m times differentiable, where $m \geq 2$ . Hence, they do not produce any estimator of the derivatives of $f_{*}$ . Rather, the kriging estimator of $f_{*} (x)$ at any $x \in {[a, b]}^{d}$ is expressed as the weighted sum of the $Y_{i}$ ’s. The weights are dependent on $x_{0}$ and calculated in a way in which the mean squared deviations are minimized. In Chen et al. [9], under the assumption that the regression function is differentiable, the gradient estimates are observed along with the $Y_{i}$ ’s, and they are used to obtain an estimator of $f_{*}$ that shows superior performance to the stochastic kriging estimator (see Ankenman et al. [3]). In the radial basis function estimators, the regression function is approximated by a linear combination of radially symmetric functions, where the coefficients and other parameters are estimated from the data set; see, for example, Franke [21]. The orthogonal series estimators or wavelet estimators represent the regression function by its Fourier series and estimate the coefficients from the data set; see, for example, Donoho and Johnstone [18]. Nonparametric regression methods, such as the kernel regression method (Nadaraya [53], Watson [73]), the local polynomial regression estimators (Cleveland [12]), the smoothing splines (Reinsch [55]), the weighted least squares regression estimators (Ruppert and Wand [58], Salemi et al. [59]), and the nearest neighbor estimators (Shapiro [62], Stone [64]), have received considerable attention in the statistics literature; see, for example, Eubank [20], Green and Silverman [23], Härdle [29], and Wand and Jones [72] for comprehensive surveys. Most of these estimators invariably require the estimation of smoothing parameters that cannot be estimated directly from the data set. In this regard, Problem (B) has the advantage of estimating only one parameter, $S_{n}$ , which has a direct meaning and can be estimated from the sample variances computed from the data set.

1.2. Organization of This Paper

This paper is organized as follows. Section 2 introduces some definitions and preliminaries. We precisely describe the relationship between Problems (A) and (B) in Section 3, whereas Section 4 is concerned with the convex programming formulations of Problems (A) and (B). Section 5 states the main results regarding consistency and convergence rates, and the numerical results can be found in Section 6. Section 7 provides discussions on future research topics. The proofs of all theorems, corollaries, lemmas, and propositions are provided in the Appendix.

2. Definitions and Preliminaries

For $x = (x_{1}, \dots, x_{d}) \in R^{d}$ , its norm is given by $‖ x ‖ = {\sum_{j = 1}^{n} x_{j}^{2}}^{1 / 2}$ . For a sequence of random variables $(Z_{n} : n \geq 1)$ and a sequence of positive real numbers $(α_{n} : n \geq 1)$ , we say $Z_{n} = O_{p} (α_{n})$ as $n \to \infty$ if for any $ϵ > 0$ , there exist constants C and N such that $P (| Z_{n} / α_{n} | > C) < ϵ$ for all $n \geq N$ .

When $2 m > d$ , we define $F_{m}$ by the space of generalized functions whose weak partial derivatives of order m are square integrable, that is,

F_{m} = {f \in D^{'} (R^{d}) : \int_{R^{d}} {D^{α} f (x)}^{2} d x < \infty for all α such that | α | = m},

where

D^{'} (R^{d})

is the space of Schwartz distributions. A test function on

R^{d}

is a real-valued, infinitely differentiable function with compact support. Let

D (R^{d})

be the set of all test functions on

R^{d}

. A functional on

D (R^{d})

is a mapping that assigns to each

ϕ \in D (R^{d})

a real number. A (Schwartz) distribution on

R^{d}

is a continuous linear functional on

D (R^{d})

. See Adams and Fournier [2, p. 20] or Oden and Reddy [54, p. 14] for detailed definitions. Let

{(\cdot, \cdot)}_{m}

denote the semi-inner product on

F_{m}

, defined by

{(f, g)}_{m} ≜ \sum_{| α | = m} \int_{R^{d}} {D^{α} f (x)} {D^{α} g (x)} d x

for

f, g \in F_{m}

. The associated seminorm

‖ \cdot ‖_{m}

is defined by

‖ f ‖_{m} ≜ {(f, f)}_{m}^{1 / 2}

. It should be noted that the kernel of this seminorm is the vector space

P_{m - 1}

of all polynomials of total degree less than or equal to

m - 1

defined on

R^{d}

, the dimension of

P_{m - 1}

M ≜ (\begin{matrix} m + d - 1 \\ d \end{matrix}),

(3)

and

P_{m - 1}

is spanned by the M monomials of the total degree less than or equal to

m - 1

. We will denote the M monomials of total degree less than or equal to

m - 1

p_{1}, \dots, p_{M}

It should be also noted that $‖ \cdot ‖_{m}$ is a seminorm on $F_{m}$ , but a norm on the equivalence classes in $F_{m} / P_{m - 1}$ , and $F_{m} / P_{m - 1}$ is a Hilbert space under ${(\cdot, \cdot)}_{m}$ ; see Meinguet [51, theorem 1, p. 130] for details.

For any open-bounded subset $Ω \subset R^{d}$ , we define $L_{2} (Ω)$ by the set of functions f defined on $Ω$ satisfying $\int_{Ω} {f (x)}^{2} d x < \infty$ . Its $L_{2} (Ω)$ norm is defined as follows:

‖ f ‖_{L_{2} (Ω)} = {(\int_{Ω} {f (x)}^{2} d x)}^{1 / 2}

for

f \in L_{2} (Ω)

. We also define

W_{m} (Ω)

by the set of functions whose weak partial derivatives up to order m are in

L_{2} (Ω)

, that is,

W_{m} (Ω) = {f \in L_{2} (Ω) : D^{α} f \in L_{2} (Ω) for all α such that | α | \leq m} .

Its $W_{m} (Ω)$ norm is defined as follows:

‖ f ‖_{W_{m} (Ω)} = {(\int_{Ω} \sum_{| α | \leq m} {D^{α} f (x)}^{2} d x)}^{1 / 2}

for

f \in W_{m} (Ω)

W_{m} (Ω)

is referred to as a Sobolev space of order m on

Ω

3. Relationships Between Problems (A) and (B)

In this section, we study the relationship between Problems (A) and (B). When $d = 1$ , Schoenberg [60] established some results similar to Lemma 1, Proposition 3, Proposition 4, and Proposition 5 of this paper. Thus, some of our results in this section can be viewed as extensions of those in Schoenberg [60] to the multidimensional case. Throughout this section, we assume that $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ are given and that the following assumption holds.

Assumption 1.

m is a positive integer satisfying $2 m > d$ , and ${X_{1}, \dots, X_{n}}$ is a set of mutually distinct points containing a $P_{m - 1}$ -unisolvent set. A set of points ${x_{1}, \dots, x_{L}} \subset R^{d}$ is called $P_{m - 1}$ -unisolvent if for any $p \in P_{m - 1}$ , the condition $p (x_{j}) = 0$ for all $1 \leq j \leq L$ implies $p (x) = 0$ for all $x \in R^{d}$ . In particular, when $d = 1$ , ${x_{1}, \dots, x_{L}}$ is $P_{m - 1}$ -unisolvent if $L \geq m$ and $x_{1}, \dots, x_{L}$ are mutually distinct.

Propositions 1 and 2 establish the existence of the solutions to Problems (A) and (B) under Assumption 1.

Proposition 1.

Let $U_{n} \geq 0$ be given. Under Assumption 1, there exists a solution to Problem (A). Furthermore, if ${\hat{f}}_{n}$ and ${\tilde{f}}_{n}$ are two solutions to Problem (A), then ${\hat{f}}_{n} (X_{i}) = {\tilde{f}}_{n} (X_{i})$ for $1 \leq i \leq n$ .

Proposition 2.

Let $S_{n} \geq 0$ be given. Under Assumption 1, there exists a solution to Problem (B). Furthermore, if ${\hat{g}}_{n}$ and ${\tilde{g}}_{n}$ are two solutions to Problem (B), then they differ by a polynomial of total degree less than or equal to $m - 1$ (i.e., ${\hat{g}}_{n} - {\tilde{g}}_{n} \in P_{m - 1}$ ).

Our next propositions, Propositions 3, 4, and 5, are concerned with the uniqueness of the solutions to Problems (A) and (B). For our analysis, we need to define the following function: $Ψ_{n} : [0, \infty) \to R$ . For any nonnegative real number u, let $Ψ_{n} (u) = E_{n} (f^{u})$ , where $f^{u}$ is a solution to Problem (A) with $U_{n} = u$ . The following lemma characterizes the shape of $Ψ_{n}$ .

Lemma 1.

There exists a minimizer, say $f_{P}$ , of $E_{n} (f)$ over $f \in P_{m - 1}$ . Under Assumption 1, there exists a unique solution, say $f_{I}$ , to the following problem:

\begin{array}{l} {Minimize}_{f \in F_{m}} J (f) & subject t o f (X_{i}) = Y_{i} 1 \leq i \leq n . \end{array}

Under Assumption 1, $Ψ_{n}$ is nonnegative, convex, and strictly decreasing over $[0, J (f_{I})]$ with $Ψ_{n} (0) = E_{n} (f_{P})$ and $Ψ_{n} (x) = 0$ for all $x \geq J (f_{I})$ .

The following propositions discuss the uniqueness of the solutions to Problems (A) and (B) and the relationship between them.

Proposition 3.

Assume Assumption 1. Let $0 \leq U_{n} < J (f_{I})$ be given. Then, there exists a unique solution ${\hat{f}}_{n}$ to Problem (A), and ${\hat{f}}_{n}$ satisfies $J ({\hat{f}}_{n}) = U_{n}$ . In other words, if ${\hat{f}}_{n}$ and ${\tilde{f}}_{n}$ are two solutions to Problem (A), then ${\hat{f}}_{n} (x) = {\tilde{f}}_{n} (x)$ for all $x \in R^{d}$ . Furthermore, ${\hat{f}}_{n}$ is also a solution to Problem (B) with $S_{n} = E_{n} ({\hat{f}}_{n})$ .

Proposition 4.

Assume Assumption 1. Let $0 < S_{n} \leq E_{n} (f_{P})$ be given. Then, there exists a unique solution ${\hat{g}}_{n}$ to Problem (B), and ${\hat{g}}_{n}$ satisfies $E_{n} ({\hat{g}}_{n}) = S_{n}$ . In other words, if ${\hat{g}}_{n}$ and ${\tilde{g}}_{n}$ are two solutions to Problem (B), then ${\hat{g}}_{n} (x) = {\tilde{g}}_{n} (x)$ for all $x \in R^{d}$ . Furthermore, ${\hat{g}}_{n}$ is also a solution to Problem (A) with $U_{n} = J ({\hat{g}}_{n})$ .

Combining Propositions 3 and 4 yields the following proposition.

Proposition 5.

Assume Assumption 1.

Let $0 \leq U_{n} \leq J (f_{I})$ be given. Then, there exists a unique solution ${\hat{f}}_{n}$ to Problem (A), and ${\hat{f}}_{n}$ satisfies $J ({\hat{f}}_{n}) = U_{n}$ . Furthermore, ${\hat{f}}_{n}$ is also a unique solution to Problem (B) with $S_{n} = E_{n} ({\hat{f}}_{n})$ .
Let $0 < S_{n} \leq E_{n} (f_{P})$ be given. Then, there exists a unique solution ${\hat{g}}_{n}$ to Problem (B), and ${\hat{g}}_{n}$ satisfies $E_{n} ({\hat{g}}_{n}) = S_{n}$ . Furthermore, ${\hat{g}}_{n}$ is also a unique solution to Problem (A) with $U_{n} = J ({\hat{g}}_{n})$ .

The proofs of Lemma 1 and Propositions 1–5 are provided in Appendix A.

4. Convex Programming Formulations for Problems (A) and (B)

In this section, we discuss the question of how to compute the solutions to Problems (A) and (B) numerically. The following propositions, Propositions 6 and 7, reveal that we can find the solutions to Problems (A) and (B) by solving convex programs. Their proofs are provided in Appendix A.

We start with some definitions. Let $s_{1}, \dots, s_{M}$ be any $P_{m - 1}$ -unisolvent set of M fixed points in $R^{d}$ . (M is given in (3).) Let $q_{1}, \dots, q_{M}$ be the unique polynomials of total degree less than m satisfying $q_{i} (s_{j}) = 1$ if $i = j$ and $q_{i} (s_{j}) = 0$ if $i \neq j$ , that is,

\begin{array}{l} q_{i} (x) = [\begin{matrix} p_{1} (x) & \dots & p_{M} (x) \end{matrix}] {[\begin{matrix} p_{1} (s_{1}) & \dots & p_{M} (s_{1}) \\ ⋮ & ⋮ \\ p_{1} (s_{M}) & \dots & p_{M} (s_{M}) \end{matrix}]}^{- 1} \cdot e_{i} \end{array}

for

x \in R^{d}

and

1 \leq i \leq M

, where

e_{i}

is the

M \times 1

column vector whose ith entry is one and zero elsewhere. Let

R : R^{d} \times R^{d} \to R

be defined by

\begin{array}{l} R (s, t) = {(- 1)}^{m} [K_{m} (s - t) - \sum_{i = 1}^{M} q_{i} (t) K_{m} (s - s_{i}) - \sum_{j = 1}^{M} q_{j} (s) K_{m} (s_{j} - t) + \sum_{i = 1}^{M} \sum_{j = 1}^{M} q_{i} (s) q_{j} (t) K_{m} (s_{i} - s_{j})] \end{array}

for

s, t \in R^{d}

, where

K_{m} : R^{d} \to R

is given by

K_{m} (z) = {\begin{array}{l} θ_{m, d} ‖ z ‖^{2 m - d} \ln ‖ z ‖, & if 2 m - d is even \\ θ_{m, d} ‖ z ‖^{2 m - d}, & otherwise \end{array}

for

z \in R^{d}

θ_{m, d} = {\begin{array}{l} \frac{{(- 1)}^{d / 2 + 1}}{2^{2 m - 1} π^{d / 2} (m - 1)! (m - d / 2)!}, & if 2 m - d is even \\ \frac{{(- 1)}^{m} Γ (d / 2 - m)}{2^{2 m} π^{d / 2} (m - 1)!}, & otherwise \end{array}

and

Γ (\cdot)

denotes the gamma function.

Proposition 6.

Let $U_{n} \geq 0$ be given. Under Assumption 1, a solution to Problem (A) exists by Proposition 1. Furthermore, for any solution ${\tilde{f}}_{n}$ to Problem (A), there exists a solution ${\hat{f}}_{n}$ to Problem (A), satisfying ${\tilde{f}}_{n} (X_{i}) = {\hat{f}}_{n} (X_{i})$ for $1 \leq i \leq n$ , which can be represented by

{\hat{f}}_{n} (x) = \sum_{i = 1}^{M} {\hat{c}}_{i} p_{i} (x) + \sum_{i = 1}^{n} {\hat{d}}_{i} R (x, X_{i})

(4)

for

x \in R^{d}

, where

({\hat{c}}_{1}, \dots, {\hat{c}}_{M},

{\hat{d}}_{1}, \dots, {\hat{d}}_{n},

{\hat{f}}_{n} (X_{1}), \dots, {\hat{f}}_{n} (X_{n}))

is a solution to the following convex program in the decision variables

c_{1}, \dots, c_{M},

d_{1}, \dots, d_{n}

y_{1},

\dots, y_{n} \in R

\begin{array}{l} Minimize \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - y_{i})}^{2} \\ subject t o \sum_{i = 1}^{n} \sum_{j = 1}^{n} R (X_{i}, X_{j}) d_{i} d_{j} \leq U_{n} \\ \sum_{i = 1}^{M} c_{i} p_{i} (X_{j}) + \sum_{i = 1}^{n} d_{i} R (X_{j}, X_{i}) = y_{j} 1 \leq j \leq n . \end{array}

(5)

Conversely, any function of form (4), where $({\hat{c}}_{1}, \dots, {\hat{c}}_{M},$ ${\hat{d}}_{1}, \dots, {\hat{d}}_{n},$ ${\hat{f}}_{n} (X_{1}), \dots, {\hat{f}}_{n} (X_{n}))$ is any solution to (6), is a solution to Problem (A).

Proposition 7.

Let $S_{n} \geq 0$ be given. Under Assumption 1, a solution to Problem (B) exists by Proposition 2. Furthermore, ${\hat{g}}_{n}$ is a solution to Problem (B) if and only if ${\hat{g}}_{n}$ is represented by

{\hat{g}}_{n} (x) = \sum_{i = 1}^{M} {\hat{c}}_{i} p_{i} (x) + \sum_{i = 1}^{n} {\hat{d}}_{i} R (x, X_{i})

(6)

for

x \in R^{d}

, where

({\hat{c}}_{1}, \dots, {\hat{c}}_{M},

{\hat{d}}_{1}, \dots, {\hat{d}}_{n}

{\hat{g}}_{n} (X_{1}), \dots, {\hat{g}}_{n} (X_{n}))

is a solution to the following convex program in the decision variables

c_{1}, \dots, c_{M},

d_{1}, \dots, d_{n}

y_{1},

\dots, y_{n} \in R

\begin{array}{l} Minimize \sum_{i = 1}^{n} \sum_{j = 1}^{n} R (X_{i}, X_{j}) d_{i} d_{j} \\ subject t o \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - y_{i})}^{2} \leq S_{n} \\ \sum_{i = 1}^{M} c_{i} p_{i} (X_{j}) + \sum_{i = 1}^{n} d_{i} R (X_{j}, X_{i}) = y_{j} 1 \leq j \leq n . \end{array}

Remark 1.

It should be noted that when $d = 1$ , the representation in (4) coincides with the form of a natural spline; see Greville [25, equation 4.1 on p. 3] for the details.

Remark 2.

Proposition 6 reveals that for any solution ${\tilde{f}}_{n}$ to Problem (A), there exists a solution ${\hat{f}}_{n}$ to Problem (A) satisfying ${\tilde{f}}_{n} (X_{i}) = {\hat{f}}_{n} (X_{i})$ for $1 \leq i \leq n$ , which is a linear combination of the $p_{i}$ ’s and $R (\cdot, X_{i})$ ’s. In addition, Proposition 7 implies that a solution to Problem (B) is always a linear combination of the $p_{i}$ ’s and $R (\cdot, X_{i})$ ’s.

5. Consistency and Rates of Convergence

In this section, we focus on the consistency and convergence rates of the solutions to Problems (A) and (B). Toward this goal, we need the following assumptions.

Assumption 2.

$2 m > d$ .
$X, X_{1}, X_{2}, \dots$ are iid ${[a, b]}^{d}$ -valued random vectors having a common positive continuous density function $τ : {[a, b]}^{d} \to R$ . This implies that there exist $τ_{1}, τ_{2} > 0$ such that $τ_{1} \leq τ (x) \leq τ_{2}$ for all $x \in {[a, b]}^{d}$ .

Assumption 3.

$f_{*} \in F_{m}$ .
$(X, Y), (X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots$ is a sequence of iid ${[a, b]}^{d} \times R$ -valued random vectors satisfying $Y = f_{*} (X) + ε$ and $Y_{i} = f_{*} (X_{i}) + ε_{i}$ for $i = 1, 2, \dots$ .
$E (ε | X) = E (ε_{i} | X_{i}) = 0$ and $E (ε^{2} | X) = E (ε_{i}^{2} | X_{i}) = σ^{2} < \infty$ for $i = 1, 2, \dots$ .

Assumption 4.

$ε, ε_{1}, ε_{2}, \dots$ are uniformly sub-Gaussian random variables (i.e., there exist positive real numbers A and B such that $E (\exp (t ε)) \leq A \exp (B t^{2})$ for any $t \in R$ ).

Assumption 5.

$E (f_{*} {(X)}^{2}) < \infty$ .
$f_{*}$ , restricted to ${[a, b]}^{d}$ , is not a polynomial of degree less than or equal to $m - 1$ .
$f_{*}$ is continuous on ${[a, b]}^{d}$ .

We need Assumption 2(i) for the following reason. When $2 m > d$ is not assumed, a function in $F_{m}$ is not well defined on a set of measure 0 because the functions in $F_{m}$ are determined only outside a set of measure 0. Under the assumption $2 m > d$ , it is known that the functions in $F_{m}$ are continuous, so the point evaluation is well defined; see Meinguet [51, theorem 1 on p. 130].

The next lemma shows that Assumption 2 guarantees a very important property of the $X_{i}$ ’s, the so-called $P_{m - 1}$ -unisolvency. The proof of Lemma 2 is provided in Appendix A.

Lemma 2.

Under Assumption 2, ${X_{1}, \dots, X_{n}}$ is a set of mutually distinct points containing a $P_{m - 1}$ -unisolvent set for n sufficiently large a.s. In other words, Assumption 2 implies Assumption 1 for n sufficiently large a.s.

Lemma 2 ensures that the solutions to Problems (A) and (B) exist for n sufficiently large a.s. under Assumption 2.

We are now ready to present our main results. Theorems 1 and 2 reveal that ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ are consistent estimators of $f_{*}$ and that $D^{α} \hat{f}$ and $D^{α} {\hat{g}}_{n}$ are consistent estimators of $D^{α} f_{*}$ for $| α | = 1, \dots, m - 1$ . Theorem 3 computes a lower bound on the rate of convergence of ${\hat{f}}_{n}$ . We discuss the rate of convergence of ${\hat{g}}_{n}$ in Remark 4.

Theorems 1 and 2 use the following assumptions on ${\hat{f}}_{n}$ and ${\hat{g}}_{n}$ .

Assumption 6.

There exists a constant $c_{A}$ such that ${lim sup}_{n \to \infty} J ({\hat{f}}_{n}) \leq c_{A}$ for all n sufficiently large a.s.

Assumption 7.

There exists a constant $c_{B}$ such that ${lim sup}_{n \to \infty} J ({\hat{g}}_{n}) \leq c_{B}$ for all n sufficiently large a.s.

Assumption 8.

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i})) + A_{n}

for a sequence of random variables

(A_{n} : n \geq 1)

satisfying

{lim sup}_{n \to \infty} A_{n} \leq 0

a.s.

Assumption 9.

\frac{1}{n} \sum_{i = 1}^{n} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) + B_{n}

for a sequence of random variables

(B_{n} : n \geq 1)

satisfying

{lim sup}_{n \to \infty} B_{n} \leq 0

a.s.

Although Assumptions 6–9 are somewhat hard to verify, we can establish them by imposing some conditions on $U_{n}$ and $S_{n}$ . For example, Assumptions 6, 8, and 9 are implied by Assumptions 10, 11, and 12, respectively.

Assumption 10.

${lim sup}_{n \to \infty} U_{n} < \infty$ .

Assumption 11.

$J (f_{*}) \leq U_{n}$ for all n sufficiently large a.s.

Assumption 12.

${lim sup}_{n \to \infty} S_{n} \leq σ^{2}$ .

We are now ready to present Theorem 1, Theorem 2, Corollary 1, and Corollary 2.

Theorem 1.

Under Assumptions 2, 3, 6, and 8,

\sup_{x \in {[a, b]}^{d}} | {\hat{f}}_{n} (x) - f_{*} (x) | \to 0 a . s .,

(7)

E {({\hat{f}}_{n} (X) - f_{*} (X))}^{2} \to 0 and E {(D^{α} {\hat{f}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0

(8)

for

| α | = 1, \dots, m - 1

n \to \infty

Theorem 2.

Under Assumptions 2, 3, 7, and 9,

\sup_{x \in {[a, b]}^{d}} | {\hat{g}}_{n} (x) - f_{*} (x) | \to 0 a . s .,

(9)

E {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} \to 0 and E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0

(10)

for

| α | = 1, \dots, m - 1

n \to \infty

Corollary 1.

Under Assumptions 2, 3, 10, and 11, (7) and (8) hold.

Corollary 2.

Under Assumptions 2, 3, 7, and 12, (9) and (10) hold.

We now turn to the convergence rate of ${\hat{f}}_{n}$ . We need Assumption 13, which is implied by Assumption 11.

Assumption 13.

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))

for all n sufficiently large a.s.

Under Assumption 13, the convergence rate of ${\hat{f}}_{n}$ is established in Theorem 3.

Theorem 3.

Under Assumptions 2, 3, 4, 6, and 13,

{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}}^{1 / 2} = O_{p} (n^{- m / (2 m + d)})

(11)

n \to \infty

Corollary 3.

Under Assumptions 2, 3, 4, 10, and 11, (11) holds.

Remark 3.

It should be noted that the rate $- m / (2 m + d)$ established in Theorem 3 is identical to the optimal rate of convergence for the estimation of a function of smoothness of order m, which was established by Stone [65]. However, as Corollary 3 states, this rate is achieved under Assumptions 10 and 11, which may be hard to verify if sufficient information on $J (f_{*})$ is not available. One should also note that there are many estimators that achieve this optimal rate. See Kohler et al. [38] for a kernel estimator; Cox [14] for a smoothing spline estimator; Györfi et al. [27, theorems 4.3, 5.2., and 6.2] for partitioning, kernel, and nearest neighbor estimators, respectively; and Kohler et al. [37] for partitioning and nearest neighbor estimators. Furthermore, Stone [65] established that the optimal rate of convergence for the estimation of the rth derivative of a function of smoothness of order m is $- (m - r) / (2 m + d)$ for $r = 0, 1, \dots, m - 1$ , and this rate is achieved by the smoothing spline estimator proposed by Cox [14].

We next discuss a situation where the solution to Problem (B) exists uniquely.

Theorem 4.

Consider the problem:

Minimize E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X))}^{2}

(12)

over

c_{1}, \dots, c_{M} \in R

. Under Assumption 5, (12) has a solution, and the minimizing value

ν^{*}

of (12) is positive. Furthermore, if

{lim sup}_{n \to \infty} S_{n} < σ^{2} + ν^{*}

a.s., under Assumption 2, Assumption 3(ii), Assumption 3(iii), and Assumption 5, the solution to Problem (B) exists uniquely for n sufficiently large a.s.

The proofs of Theorems 1–4 and Corollaries 1–3 are provided in Appendix B.

Remark 4.

The same convergence rate as in Theorem 3 can be obtained for ${\hat{g}}_{n}$ under the assumption

\frac{1}{n} \sum_{i = 1}^{n} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))

(13)

for all n sufficiently large a.s. (Arguments similar to those in the proof of Theorem 3 can be applied.) However, we were not able to find a suitable condition on

S_{n}

under which (13) holds.

Remark 5.

In the case where the error $ε_{1}$ is dependent on $X_{1}$ , our results in Theorems 1, 2, 3, and 4 hold with light modifications in the proofs.

Remark 6.

In many practical situations, the variance of the $Y_{i}$ ’s is heterogeneous (i.e., $Var (ε_{1})$ is dependent on $X_{1}$ ). A natural candidate for $S_{n}$ in such a case is $S_{n}^{'} ≜ E (Var (ε_{1} | X_{1})) = E (Var (Y_{1} | X_{1})) = \int_{{[a, b]}^{d}} Var (Y_{1} | x) τ (x) d x$ , where $τ (\cdot)$ is the density function of $X_{1}$ . When the $X_{i}$ ’s are uniformly distributed, $S_{n}^{'}$ can be estimated by computing the average of $Var (Y_{i} | X_{i})$ across $i \in {1, \dots, n}$ .

6. Numerical Results

In this section, we observe the numerical behavior of the solutions to Problems (A), (B), and (C). To describe the detailed procedure, we start by noticing that, in practice, we can often observe $Y = f_{*} (X) + ε$ multiple times at a fixed design point $X \in {[a, b]}^{d}$ . Thus, we assume that we can collect iid observations $((X_{i}, Y_{i j}) : 1 \leq i \leq n, 1 \leq j \leq r)$ , where $Y_{i j} = f_{*} (X_{i}) + ε_{i j}$ for $1 \leq i \leq n$ and $1 \leq j \leq r$ and the $ε_{i j}$ ’s, conditional on $X_{i}$ , are iid copies of $ε$ conditional on X. We then compute the average of the $Y_{i j}$ ’s across multiple replications (i.e., ${\bar{Y}}_{i} = \sum_{j = 1}^{r} Y_{i j} / r$ for each $i \in {1, \dots, n}$ ). Next, the ${\bar{Y}}_{i}$ ’s will be used in place of the $Y_{i}$ ’s in Problems (A), (B), and (C).

Next, we provide more detailed descriptions on how the solutions to Problems (A)–(C) are obtained.

6.1. Problem (B)

In order to estimate $S_{n}$ , we observe that Remark 6 suggests using the average of the $Var (Y_{i} | X_{i})$ ’s as an estimate of $S_{n}$ when the $X_{i}$ ’s are uniformly distributed. Because we use the ${\bar{Y}}_{i}$ ’s in place of the $Y_{i}$ ’s, we use the average of $Var ({\bar{Y}}_{i} | X_{i}) = (1 / r) Var (Y_{i} | X_{i})$ across $i \in {1, \dots, n}$ as an estimate of $S_{n}$ . Therefore, our estimate of $S_{n}$ is

\frac{1}{n r} \sum_{i = 1}^{n} {\hat{S}}_{i},

(14)

where

{\hat{S}}_{i} = \sum_{j = 1}^{r} {(Y_{i j} - {\bar{Y}}_{i})}^{2} / (r - 1)

We then solve the convex programming problem in Proposition 7 with the $Y_{i}$ ’s replaced by the ${\bar{Y}}_{i}$ ’s and $S_{n}$ replaced by (14) using CVX, a package for solving convex programs developed by Grant and Boyd [22]. The solution obtained this way is denoted by ${\hat{g}}_{n}$ , and its objective value $J ({\hat{g}}_{n})$ is denoted by ${\hat{U}}_{n}$ . ${\hat{U}}_{n}$ is used as an estimate of $U_{n}$ when solving Problem (A).

6.2. Problem (A)

We next solve the convex programming problem in Proposition 6 by using ${\hat{U}}_{n} = J ({\hat{g}}_{n})$ as an estimate of $U_{n}$ and the ${\bar{Y}}_{i}$ ’s in place of the $Y_{i}$ ’s. CVX is used to solve the convex programming formulation.

6.3. Problem (C)

The key issue here is how to determine the smoothing parameter $λ$ . We use cross validation to determine the smoothing parameter. We also use a fixed sequence of real numbers $(λ_{n} : n \geq 1)$ decreasing to zero as $n \to \infty$ . The smoothing parameter $λ$ chosen by cross validation depends on the observed data set $((X_{i}, Y_{i j}) : 1 \leq i \leq n, 1 \leq j \leq r)$ , whereas $λ_{n}$ does not depend on the observed data set and depends on n only.

To define the cross validation function, let $f_{λ}^{[k]}$ be the minimizer of

\frac{1}{n} \sum_{i = 1, i \neq k}^{n} ({\bar{Y}}_{i} - f (X_{i}))^{2} + λ J (f)

over

f \in F_{m}

. Then, the cross validation function

V (λ)

is defined by

V (λ) = \frac{1}{n} \sum_{k = 1}^{n} {({\bar{Y}}_{k} - f_{λ}^{[k]} (X_{k}))}^{2}

(15)

for

λ \geq 0

, and the estimator of the smoothing parameter based on cross validation is the minimizer of

V (λ)

over

λ \geq 0

. We call this estimator

{\hat{λ}}_{C V}

. The solution to Problem (C) based on cross validation is computed by solving Problem (C) with

λ

replaced by

{\hat{λ}}_{C V}

and the

Y_{i}

’s replaced by the

{\bar{Y}}_{i}

’s; see Wahba [71, equations (2.4.23) and (2.4.24) on p. 33] for the details on how Problem (C) is solved using systems of linear equations.

We also solve Problem (C) with the $Y_{i}$ ’s replaced by the ${\bar{Y}}_{i}$ ’s and $λ$ replaced by $λ_{n}$ for $n \geq 1$ . The specific choice of $(λ_{n} : n \geq 1)$ is given in Sections 6.5 and 6.6.

6.4. How to Select m

One needs a priori knowledge on m to determine the right value of m. It should be noted that the smoothness of a function $g : R \to R$ is measured by $\int_{- \infty}^{\infty} {g^{(2)} (x)}^{2} d x$ . Thus, if $f_{*}$ itself is believed to be smooth, then we set $m = 2$ . If $f_{*}$ is believed to have smooth partial derivatives of order 2, then we set $m = 4$ .

6.5. M/M/1 Queue

We consider the case where $f_{*} (x)$ is the steady-state mean waiting time of a customer in a single-server queue under the first come, first served discipline with infinite capacity buffer, exponential service times, exponential interarrival times, unit arrival rate, and a service rate of $x \in [1.5, 2.0]$ . $f_{*} (x)$ is known to be $1 / x (x - 1)$ ; see Hillier and Lieberman [31, p. 781]. We chose an M/M/1 queue because there exists a closed-form formula for the steady-state waiting time of a customer, so we can compare our estimators with the true values. We set $X_{i} = 1.5 + i / (2 n) - 1 / (4 n)$ for $1 \leq i \leq n$ . For each $i \in {1, \dots, n}$ and $j \in {1, \dots, r}$ with $r = 100$ , we generated $Y_{i j}$ by averaging the waiting times of the first 1,000 customers arriving at the queue when the queue is initialized empty and idle. We next computed the solution ${\hat{g}}_{n}$ to Problem (B) and the solution ${\hat{f}}_{n}$ to Problem (A) with the $Y_{i}$ ’s replaced by the ${\bar{Y}}_{i}$ ’s and $m = 4$ . $S_{n}$ is estimated from (14), and $U_{n}$ is estimated from $J ({\hat{g}}_{n})$ . Both Problems (A) and (B) are solved using CVX.

For Problem (C), we computed ${\hat{λ}}_{C V}$ by minimizing $V (λ)$ in (15) over $λ \in {10^{- 11}, 5 \cdot 10^{- 11}, 10^{- 10}, 5 \cdot 10^{- 10}, \dots, 1}$ . In addition to ${\hat{λ}}_{C V}$ , we used three sequences of real numbers, $λ (1) = (10^{- 6} / n : n \geq 1)$ , $λ (2) = (10^{- 7} / n : n \geq 1)$ , and $λ (3) = (10^{- 8} / n : n \geq 1)$ , for $λ$ . To select $λ (2)$ , we tried several sequences of real numbers and selected the one generating the lowest empirical integrated mean square error (EIMSE) values for $f_{*}$ and $f_{*}^{(2)}$ . The rate of decay for $(λ_{n} : n \geq 1)$ was chosen to be of order $1 / n$ based on the suggestion made by Cox [13].

To measure the accuracy of ${\hat{g}}_{n}$ , the solution to Problem (B), we computed the EIMSE of ${\hat{g}}_{n}$ as follows:

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} .

We also computed the EIMSE of ${\hat{g}}_{n}^{(2)}$ as follows:

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n}^{(2)} (X_{i}) - f_{*}^{(2)} (X_{i}))^{2},

where

{\hat{g}}_{n}^{(2)}

and

f_{*}^{(2)}

denote the second derivatives of

{\hat{g}}_{n}

and

f_{*}

, respectively. The EIMSE values of the solutions to Problems (A) and (C) and those of their second derivatives are computed similarly.

Table 1 reports the 95% confidence intervals of the EIMSE values of the estimators of $f_{*}$ computed from Problems (A)–(C) based on 400 iid replications for each n value. The columns labeled as (A) and (B) report the results obtained from Problems (A) and (B), respectively. The columns labeled as CV, $λ (1)$ , $λ (2)$ , and $λ (3)$ report the results obtained from Problem (C) when $λ$ is set as ${\hat{λ}}_{C V}$ , $λ (1)$ , $λ (2)$ , and $λ (3)$ , respectively. The values in Table 1 are measured in $10^{- 4}$ . Table 2 reports the 95% confidence intervals of the EIMSE values of the estimators of $f_{*}^{(2)}$ computed from Problems (A)–(C) based on 400 iid replications for each n value.

Table 1. The 95% confidence intervals of the EIMSE values for the estimators of $f_{*}$ computed from Problems (A)–(C) when $f_{*}$ is the steady-state mean waiting time of a customer in an M/M/1 queue. All values are measured in $10^{- 4}$ .

n	(A)	(B)	(C)
n	(A)	(B)	CV	$λ (1)$	$λ (2)$	$λ (3)$
15	$1.21 \pm 0.09$	$1.14 \pm 0.09$	$1.24 \pm 0.10$	$1.13 \pm 0.09$	$1.13 \pm 0.09$	$1.16 \pm 0.09$
25	$0.96 \pm 0.07$	$0.87 \pm 0.06$	$0.93 \pm 0.06$	$0.86 \pm 0.06$	$0.86 \pm 0.06$	$0.86 \pm 0.06$
35	$0.75 \pm 0.05$	$0.67 \pm 0.05$	$0.70 \pm 0.05$	$0.65 \pm 0.04$	$0.65 \pm 0.05$	$0.67 \pm 0.05$

Table 2. The 95% confidence intervals of the EIMSE values for the estimators of $f_{*}^{(2)}$ computed from Problems (A)–(C) when $f_{*}$ is the steady-state mean waiting time of a customer in an M/M/1 queue.

n	(A)	(B)	(C)
n	(A)	(B)	CV	$λ (1)$	$λ (2)$	$λ (3)$
15	$7.52 \pm 0.42$	$6.06 \pm 0.43$	$19.63 \pm 3.70$	$5.34 \pm 0.42$	$5.32 \pm 0.44$	$7.07 \pm 0.74$
25	$6.79 \pm 0.36$	$5.07 \pm 0.33$	$14.73 \pm 2.85$	$4.35 \pm 0.29$	$4.21 \pm 0.33$	$5.84 \pm 0.67$
35	$5.87 \pm 0.29$	$4.30 \pm 0.24$	$9.38 \pm 1.77$	$3.53 \pm 0.20$	$3.28 \pm 0.24$	$4.45 \pm 0.50$

6.6. Stock Trader’s Problem

We consider the stock trader’s problem that motivated this paper. Consider a trader who trades stock options on a daily basis. Each day, his decision is either buy or sell a stock option, and he bases his decision on gamma, which is the second derivative of the value of the stock option as a function of the underlying stock price. The value of the stock option, $f_{*}$ , typically has no closed-form formula, so simulation is required to estimate the value of the stock option and its second derivative with respect to the underlying stock price. Because simulation takes a long time to conduct, the estimation of $f_{*}$ and $f_{*}^{(2)}$ is done a day before any decision is made. Because $f_{*}$ and $f_{*}^{(2)}$ depend on the underlying stock price and because we cannot precisely predict the underlying stock price for the next day, $f_{*}$ and $f_{*}^{(2)}$ are typically estimated over a range of possible values of the underlying stock price. Hence, the trader’s goal is to estimate gamma, $f_{*}^{(2)}$ , over a range of possible values of the underlying stock price. Furthermore, in his estimation, he does not want to use any parameter that is not estimated from the data set.

To place this problem in a more specific setting, we consider the case where $f_{*} (x)$ is the price of a European call option on a nondividend-paying stock when the underlying stock price is $x \in [0, 2]$ . We chose a European call option because there exists a closed-form formula for its price, so we can compare our estimates with the true values. Specifically, we consider the case where the strike price is $K = 1.3$ , the risk-free annual interest rate is $s = 0.03$ , the stock price volatility is 0.3, the underlying stock has a drift of 0.03, and the time to maturity of the option is $T = 1$ year. It is known that the price of this option is given by

f_{*} (x) = x N (\frac{\ln (x / K) + (s + σ^{2} / 2) T}{σ \sqrt{T}}) - K e^{- s T} N (\frac{\ln (x / K) + (s - σ^{2} / 2) T}{σ \sqrt{T}})

for

x \in [0, 2]

, where

N (\cdot)

is the cumulative probability distribution function of a standard normal random variable; see, for example, Hull [32, p. 295].

To compute our estimators, we set $X_{i} = 2 i / n - 1 / n$ for $1 \leq i \leq n$ . For each $X_{i}$ , we assumed that the current stock price is $X_{i}$ and simulated a sample path $(S_{t} : 0 \leq t \leq T)$ of a geometric Brownian motion up to time T with a drift of 0.03 and a volatility of 0.3. We then computed $Y_{i j}$ , the price of the option by computing $\exp (- s T) \max (0, S_{T} - K)$ , where $S_{T}$ is the value of the geometric Brownian motion we generated at time T. This procedure was repeated $r = 5, 000$ times, generating $((X_{i}, Y_{i j}) : 1 \leq i \leq n, 1 \leq j \leq r)$ . We next computed the solutions to Problems (A) and (B) with the $Y_{i}$ ’s replaced by the ${\bar{Y}}_{i}$ ’s and $m = 4$ . $S_{n}$ is estimated from (14), and $U_{n}$ is estimated from $J ({\hat{g}}_{n})$ . Both Problems (A) and (B) are solved with CVX.

For Problem (C), we computed ${\hat{λ}}_{C V}$ by minimizing $V (λ)$ in (15) over $λ \in {10^{- 11}, 5 \cdot 10^{- 11}, 10^{- 10}, 5 \cdot 10^{- 10}, \dots, 1}$ . In addition to ${\hat{λ}}_{C V}$ , we used three sequences of real numbers, $λ (4) = (10^{- 5} / n : n \geq 1)$ , $λ (5) = (10^{- 6} / n : n \geq 1)$ , and $λ (6) = (10^{- 7} / n : n \geq 1)$ , for $λ$ . To select $λ (5)$ , we tried several sequences of real numbers and selected the one generating the lowest EIMSE values for $f_{*}$ and $f_{*}^{(2)}$ . The rate of decay for $(λ_{n} : n \geq 1)$ was chosen to be of order $1 / n$ based on the suggestion made by Cox [13].

Table 3 reports the 95% confidence intervals of the EIMSE values of the estimators of $f_{*}$ computed from Problems (A)–(C) based on 400 iid replications for each n value. The columns labeled as (A) and (B) report the results obtained from Problems (A) and (B), respectively. The columns labeled as CV, $λ (4)$ , $λ (5)$ , and $λ (6)$ report the results obtained from Problem (C) when $λ$ is set as ${\hat{λ}}_{C V}$ , $λ (4)$ , $λ (5)$ , and $λ (6)$ , respectively. The values in Table 3 are measured in $10^{- 5}$ .

Table 3. The 95% confidence intervals of the EIMSE values for the estimators of $f_{*}$ computed from Problems (A)–(C) when $f_{*}$ is the price of a European call option. All values are measured in $10^{- 5}$ .

n	(A)	(B)	(C)
n	(A)	(B)	CV	$λ (4)$	$λ (5)$	$λ (6)$
15	$5.24 \pm 0.15$	$1.72 \pm 0.11$	$1.22 \pm 0.10$	$1.23 \pm 0.08$	$0.94 \pm 0.08$	$1.02 \pm 0.08$
25	$1.14 \pm 0.07$	$1.14 \pm 0.07$	$0.80 \pm 0.06$	$0.82 \pm 0.04$	$0.60 \pm 0.05$	$0.66 \pm 0.05$
35	$0.88 \pm 0.05$	$0.88 \pm 0.05$	$0.54 \pm 0.04$	$0.60 \pm 0.03$	$0.41 \pm 0.03$	$0.46 \pm 0.03$

Table 4 reports the 95% confidence intervals of the EIMSE values of the estimators of $f_{*}^{(2)}$ computed from Problems (A)–(C) based on 400 iid replications for each n value. The values in Table 3 are measured in $10^{- 1}$ .

Table 4. The 95% confidence intervals of the EIMSE values for the estimators of $f_{*}^{(2)}$ computed from Problems (A)–(C) when $f_{*}$ is the price of a European call option. All values are measured in $10^{- 1}$ .

n	(A)	(B)	(C)
n	(A)	(B)	CV	$λ (4)$	$λ (5)$	$λ (6)$
15	$0.96 \pm 0.03$	$0.57 \pm 0.05$	$1.72 \pm 0.32$	$0.46 \pm 0.01$	$0.36 \pm 0.04$	$0.90 \pm 0.12$
25	$0.44 \pm 0.03$	$0.45 \pm 0.03$	$2.80 \pm 0.58$	$0.35 \pm 0.01$	$0.25 \pm 0.03$	$0.68 \pm 0.09$
35	$0.36 \pm 0.01$	$0.36 \pm 0.01$	$2.89 \pm 0.66$	$0.28 \pm 0.01$	$0.18 \pm 0.02$	$0.53 \pm 0.06$

6.7. Observations from Numerical Experiments

When we compare Problems (A) and (B) with Problem (C) with the smoothing parameter estimated from cross validation, Problem (B) shows superior performance when estimating the second derivative of $f_{*}$ . Problem (C) generates extreme estimates in a few percents of time, thereby degrading the overall performance. This phenomenon has been observed by other researchers; see the discussion in Wahba [71, p. 65].

When we compare Problems (A) and (B) with Problem (C) with a fixed smoothing parameter $λ_{n}$ for $n \geq 1$ , Problem (C) shows superior performance. However, it should be noted that to select $λ_{n}$ , we tried several sequences of real numbers and selected the one generating the lowest EIMSE values for $f_{*}$ and $f_{*}^{(2)}$ . This was possible because we already knew what $f_{*}$ and $f_{*}^{(2)}$ were in all of our numerical examples. In reality, one does not have a priori knowledge of the exact values of $f_{*} (X_{i})$ and $f_{*}^{(2)} (X_{i})$ for $1 \leq i \leq n$ , so finding $(λ_{n} : n \geq 1)$ this way is not feasible. However, in our numerical experiments, the results from Problem (C) with a fixed smoothing parameter $λ_{n}$ give us a sense of how well Problems (A) and (B) perform compared with the best possible performance of Problem (C). Tables 1–4 report that the performance of Problems (A) and (B) is as good as that of Problem (C) with the best possible choice of $(λ_{n} : n \geq 1)$ .

6.8. When Multiple Observations Cannot Be Made at a Fixed Design Point in ${[a, b]}^{d}$

Throughout Section 6, we assumed that multiple observations can be made at any fixed design point $x$ in ${[a, b]}^{d}$ . These multiple observations were used to empirically estimate $S_{n}$ . In many applications, multiple observations at a point $x \in {[a, b]}^{d}$ may not be available. In such cases, one needs to use other ways to empirically estimate $S_{n}$ . One possible approach is partitioning ${[a, b]}^{d}$ into a finite number of hyperrectangles, computing the sample variance of the $Y_{i}$ ’s whose $X_{i}$ ’s fall within each hyperrectangle, and computing the average of these sample variances (weighted by the density function of X evaluated at each hyperrectangle) as an estimate of $S_{n}$ . This approach is based on the observation that $S_{n}$ should ideally be $E (Var (Y | X))$ .

To illustrate this approach numerically, we consider the following example: $f_{*} : [0, 1] \to R$ is given by $f_{*} (x) = {(x - 1 / 4)}^{2}$ for $x \in [0, 1]$ , $m = 4$ , $X_{i} = i / n - 1 / (2 n)$ for $1 \leq i \leq n$ , and $Y_{i} = f_{*} (X_{i}) + ε_{i}$ for $1 \leq i \leq n$ , where the $ε_{i}$ ’s follow a uniform distribution over $[- 0.25, 0.25]$ . We assumed that only one observation $Y_{i} = f_{*} (X_{i}) + ε_{i}$ is made for $1 \leq i \leq n$ . To estimate $S_{n}$ , we partitioned [0, 1] into the five intervals $I^{1} = [0, 0.2], I^{2} = [0.2, 0.4], I^{3} = [0.4, 0.6], I^{4} = [0.6, 0.8]$ , and $I^{5} = [0.8, 1.0]$ and computed the sample variances, say $S^{1}, S^{2}, S^{3}, S^{4},$ and $S^{5}$ , in which $S^{k}$ is the sample variance of the $Y_{i}$ ’s whose $X_{i}$ ’s fall in $I^{k}$ for $1 \leq k \leq 5$ . We set $S_{n}$ equal to the average of $S^{1}, S^{2}, S^{3}, S^{4}$ , and $S^{5}$ . We then solved Problem (B) to obtain ${\hat{g}}_{n}$ and ${\hat{g}}_{n}^{(2)}$ . This procedure was repeated 1,600 times independently. Using these 1,600 replications, we computed the 95% confidence intervals of the EIMSE values of ${\hat{g}}_{n}$ and ${\hat{g}}_{n}^{(2)}$ , respectively. Table 5 reports the 95% confidence intervals for a variety of n values. The EIMSE values in Table 5 decrease as n increases, indicating that the choice of $S_{n}$ is appropriate.

Table 5. The 95% confidence intervals of the EIMSE values for ${\hat{g}}_{n}$ and ${\hat{g}}_{n}^{(2)}$ when $S_{n}$ was set as the average of sample variances across five intervals.

Table 5. The 95% confidence intervals of the EIMSE values for ${\hat{g}}_{n}$ and ${\hat{g}}_{n}^{(2)}$ when $S_{n}$ was set as the average of sample variances across five intervals.

n	EIMSE for ${\hat{g}}_{n}$	EIMSE for ${\hat{g}}_{n}^{(2)}$
30	$0.0028 \pm 0.0001$	$6.68 \pm 0.42$
40	$0.0021 \pm 0.0001$	$5.36 \pm 0.37$
50	$0.0016 \pm 0.0001$	$4.03 \pm 0.28$

7. Conclusions

We conclude this paper with some discussions on future research topics.

7.1. How to Choose $S_{n}$ When Multiple Observations Are Not Available

In Section 6.8, we discussed a way to empirically estimate $S_{n}$ when multiple observations are not available at a design point in ${[a, b]}^{d}$ . In this approach, we divide ${[a, b]}^{d}$ into smaller hyperrectangles, compute the sample variance of the $Y_{i}$ ’s whose $X_{i}$ ’s fall within each hyperrectangle, and compute the average of these sample variances (weighted by the density function of X evaluated at each hyperrectangle) as an estimate of $S_{n}$ . We conjecture that this approach will generate an asymptotically consistent estimator if the volume of the hyperrectangles decreases to zero and the number of $X_{i}$ ’s within each hyperrectangle increases to infinity as $n \to \infty$ . A more rigorous study on this conjecture is a good future research topic.

7.2. Combining the Smoothness Condition with Shape Constraints

Because our proposed formulations, Problems (A) and (B), impose only the smoothness condition on the underlying function $f_{*}$ , the proposed estimators do not necessarily produce a function that satisfies additional shape conditions on $f_{*}$ , such as convexity or monotonicity. For example, even if $f_{*}$ is convex, our estimators are not necessarily convex. Figure 1 shows the solution to Problem (B) when $f_{*} : [0, 1] \to R$ is given by $f_{*} (x) = {(x - 1 / 4)}^{2}$ for $x \in [0, 1]$ , $m = 4$ , $n = 10$ , $X_{i} = i / n - 1 / (2 n)$ for $1 \leq i \leq n$ , and $Y_{i} = f_{*} (X_{i}) + ε_{i}$ for $1 \leq i \leq n$ , where the $ε_{i}$ ’s follow a uniform distribution over $[- 0.5, 0.5]$ and $S_{n} = 1 / 12$ . Even if $f_{*}$ is convex, ${\hat{g}}_{n}$ is not necessarily convex.

**Figure 1. The solid line is $f_{*}$ , the dots are the $Y_{i}$ ’s, and the dashed line is the solution to Problem (B).**

One possible fix to this phenomenon is incorporating additional conditions to our formulation. A good topic for future research is combining shape constraints, such as convexity and monotonicity, with the smoothness condition in our proposed formulations.

7.3. Establishing Consistency Under a Condition That Is Easier to Verify

Corollary 2 is established under Assumption 7, which is hard to verify. We conjecture that the conclusions of Corollary 2 hold under a more natural condition that $S_{n} \to σ^{2}$ as $n \to \infty$ . This remains as an open question. Another open question is computing the convergence rate of ${\hat{g}}_{n}$ under suitable conditions on $S_{n}$ .

7.4. Extending Our Formulations to Shape-Constrained Estimation Problems

Problem (A) can be seen as an estimation problem of a smooth function by minimizing the sum of squared errors under the smoothness condition $J (f) \leq U_{n}$ . This type of formulation has been used for shape-constrained estimation problems, such as convex regression. For example, Mazumder et al. [49] proposed estimating an unknown convex function by minimizing the sum of squared errors over convex functions satisfying a Lipschitz condition. An interesting question for future research is how to extend Problem (B) to shape-constrained problems, such as convex and isotonic regression problems.

Appendix A. Proofs of Propositions 1–7 and Lemmas 1 and 2

To establish the existence of the solutions to Problems (A) and (B), we define the following functions $φ_{n} : R^{n} \to R$ and $ϕ_{n} : R^{n} \to R$ . For any $z = (z_{1}, \dots, z_{n}) \in R^{n}$ , consider the following problem:

Problem (I) : {Minimize}_{f \in F_{m}} J (f) subject to f (X_{i}) = z_{i} 1 \leq i \leq n .

Under Assumption 1, the solution to Problem (I) exists uniquely; see Meinguet [50, equation 29 and theorem 3 on p. 299]. We will denote the solution to Problem (I) by $f_{z}$ . Define $φ_{n} (z) = J (f_{z})$ . On the other hand, $ϕ_{n} : R^{n} \to R$ is defined by $ϕ_{n} (z) = (1 / n) \sum_{i = 1}^{n} {(Y_{i} - z_{i})}^{2}$ for $z = (z_{1}, \dots, z_{n}) \in R^{n}$ . It can be easily seen that $φ_{n}$ is convex over $R^{n}$ and that $ϕ_{n}$ is strictly convex over $R^{n}$ .

For any $S \geq 0$ and $U \geq 0$ , we define $V_{U} \subset R^{n}$ and $T_{S} \subset R^{n}$ by

V_{U} = {z \in R^{n} : φ_{n} (z) \leq U} and T_{S} = {z \in R^{n} : ϕ_{n} (z) \leq S} .

Both $S_{U}$ and $T_{S}$ are nonempty, closed, and convex subsets of $R^{n}$ .

We recall that $Ψ_{n} : [0, \infty) \to R$ is defined as follows. For any nonnegative real number u, $Ψ_{n} (u) = E_{n} (f^{u})$ , where $f^{u}$ is a solution Problem (A) with $U_{n} = u$ .

Proof of Proposition 1.

Because $V_{U}$ is nonempty, closed, and convex and because $ϕ_{n}$ is strictly convex over $V_{U}$ , there exists a unique minimizer $z^{*} = (z_{1}^{*}, \dots, z_{n}^{*})$ of $ϕ_{n}$ over $V_{U}$ and $f_{z^{*}}$ is a solution to Problem (A).

To prove the second part, let ${\hat{f}}_{n}$ and ${\tilde{f}}_{n}$ be two solutions to Problem (A). Then, $({\hat{f}}_{n} (X_{1}), \dots,$ ${\hat{f}}_{n} (X_{n}))$ and $({\tilde{f}}_{n} (X_{1}), \dots, {\tilde{f}}_{n} (X_{n}))$ are two minimizers of $ϕ_{n}$ over $V_{U}$ . By the uniqueness of the minimizer of $ϕ_{n}$ over $V_{U}$ , it follows that ${\hat{f}}_{n} (X_{i}) = {\tilde{f}}_{n} (X_{i})$ for $1 \leq i \leq n$ . $□$

Proof of Proposition 2.

Because $T_{S}$ is nonempty, closed, and convex and because $φ_{n}$ is convex, there exists a minimizer $z^{*} = (z_{1}^{*}, \dots, z_{n}^{*})$ of $φ_{n}$ over $T_{S}$ and $f_{z^{*}}$ is a solution to Problem (B).

To prove the second part, we note that for any $f \in F_{m}$ , $J (f) = ‖ f ‖_{m}^{2}$ , and $‖ \cdot ‖_{m}$ is a seminorm in $F_{m}$ and a norm in $F_{m} / P_{m - 1}$ . We will denote an equivalence class in $F_{m} / P_{m - 1}$ by [f] for $f \in F_{m}$ .

Let $y^{*}$ and $z^{*}$ be minimizers of $φ_{n}$ over $T_{S}$ . Because

‖ [f_{1 / 2 y^{*} + 1 / 2 z^{*}}] ‖_{m} \leq ‖ 1 / 2 [f_{y^{*}}] + 1 / 2 [f_{z^{*}}] ‖_{m} \leq (1 / 2) ‖ [f_{y^{*}}] ‖_{m} + (1 / 2) ‖ [f_{z^{*}}] ‖_{m} = ‖ [f_{y^{*}}] ‖_{m}

and

‖ [f_{y^{*}}] ‖_{m} \leq ‖ [f_{1 / 2 y^{*} + 1 / 2 z^{*}}] ‖_{m}

, we can conclude

‖ 1 / 2 [f_{y^{*}}] + 1 / 2 [f_{z^{*}}] ‖_{m} = (1 / 2) ‖ [f_{y^{*}}] ‖_{m} + (1 / 2) ‖ [f_{z^{*}}] ‖_{m},

which implies either

[f_{y^{*}}] = 0

[f_{y^{*}}] = c [f_{z^{*}}]

for some constant c. In the latter case, c must be equal to one because we have

‖ f_{y^{*}} ‖_{m} = ‖ f_{z^{*}} ‖_{m}

. In either case,

f_{y^{*}} - f_{z^{*}} \in P_{m - 1}

□

Proof of Lemma 1.

The existence of $f_{P}$ follows from the fact that

\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - (c_{1} p_{1} (X_{i}) + \dots + c_{M} p_{M} (X_{i})))^{2}

(A.1)

is convex over

(c_{1}, \dots, c_{M}) \in R^{M}

The only nontrivial part is that $Ψ_{n}$ is strictly decreasing over $[0, J (f_{I})]$ . Let $0 \leq x < y \leq J (f_{I})$ be given. We need to show that $Ψ_{n} (x) > Ψ_{n} (y)$ . Suppose, on the contrary, that

Ψ_{n} (x) = Ψ_{n} (y) .

(A.2)

Let $f^{x}$ and $f^{y}$ be the solutions to Problem (A) with $U = x$ and $U = y$ , respectively. Note that $(f^{x} (X_{1}), \dots, f^{x} (X_{n})) \in V_{y}$ and minimizes $ϕ_{n}$ over $V_{y}$ because of (A.2).

We will construct an n-dimensional ball around $(f^{x} (X_{1}), \dots, f^{x} (X_{n}))$ that is contained in $V_{y}$ . If this is proven, it will follow that $f^{x} (X_{i}) = Y_{i}$ for $1 \leq i \leq n$ , and hence, $f^{x}$ interpolates the $Y_{i}$ ’s, so $J (f_{I}) \leq J (f^{x})$ . Together with the facts $x < J (f_{I})$ and $J (f^{x}) \leq x$ , we will reach a contradiction.

To construct such a ball, we notice that the $X_{i}$ ’s are distinct, so we can find $δ_{0} > 0$ such that $‖ X_{i} - X_{j} ‖ > δ_{0}$ for $i \neq j$ . For any $ϵ \in (0, ϵ_{0}]$ , we will show that $(f^{x} (X_{1}) + ϵ, \dots, f^{x} (X_{n}) + ϵ)$ is in $V_{y}$ . ( $ϵ_{0} > 0$ will be specified later.) Define $θ_{ϵ} : R^{n} \to R$ by

θ_{ϵ} (z) = f^{x} (z) + ϵ (χ_{1} (z) + \dots + χ_{n} (z))

for

z \in R^{n}

, where the

χ_{i}

’s are the “mollifiers” defined by

χ_{i} (z) = {\begin{array}{l} e \cdot \exp (δ_{0} / (‖ z - X_{i} ‖^{2} - δ_{0})), & if ‖ z - X_{i} ‖ < δ_{0} \\ 0, & otherwise \end{array}

for

z \in R^{n}

and

1 \leq i \leq n

. The

χ_{i}

’s have the following properties:

$χ_{i} (X_{i}) = 1$ and $χ (X_{j}) = 0$ for all $j \neq i$ ;
the $χ_{i}$ ’s are infinitely differentiable; and
$J (χ_{i}) \leq C$ for all $1 \leq i \leq n$ and some constant C.

It follows that $θ_{ϵ} (X_{i}) = f^{x} (X_{i}) + ϵ$ for $1 \leq i \leq n$ , and

\begin{array}{l} J (θ_{ϵ}) = ‖ θ_{ϵ} ‖_{m}^{2} \leq {(‖ f^{x} ‖_{m} + ϵ ‖ χ_{1} + \dots + χ_{n} ‖_{m})}^{2} \\ \leq J (f^{x}) + 2 ϵ n C^{1 / 2} ‖ f^{x} ‖_{m} + ϵ^{2} n^{2} C, \end{array}

so taking

ϵ_{0}

sufficiently small yields

J (θ_{ϵ}) \leq J (f^{y}) \leq y

. Thus,

(f^{x} (X_{1}) + ϵ, \dots, f^{x} (X_{n}) + ϵ)

is in

V_{y}

□

Proof of Proposition 4.

Let $0 < S_{n} \leq E_{n} (f_{P})$ be given. By Proposition 2, there exists a solution ${\hat{g}}_{n}$ to Problem (B). Let $U^{*} = Ψ_{n}^{- 1} (S_{n})$ and $\tilde{h}$ be a solution to Problem (A) with $U_{n} = U^{*}$ . We will first show that ${\hat{g}}_{n}$ is also a solution to Problem (A) with $U_{n} = U^{*}$ and satisfies $E_{n} ({\hat{g}}_{n}) = S_{n}$ . By definition, $\tilde{h}$ minimizes $E_{n} (f)$ over all functions f satisfying $J (f) \leq U^{*}$ and satisfies

E_{n} (\tilde{h}) = S_{n}

(A.3)

and

J (\tilde{h}) \leq U^{*} .

(A.4)

By (A.3), $\tilde{h}$ is a feasible solution to Problem (B), so we should have

J ({\hat{g}}_{n}) \leq J (\tilde{h}) .

(A.5)

By (A.4) and (A.5), we have

J ({\hat{g}}_{n}) \leq U^{*},

(A.6)

and hence,

{\hat{g}}_{n}

is a feasible solution to Problem (A) with

U_{n} = U^{*}

. Thus,

E_{n} (\tilde{h}) \leq E_{n} ({\hat{g}}_{n}) .

(A.7)

Also, ${\hat{g}}_{n}$ is a feasible solution to Problem (B), so we should have

E_{n} ({\hat{g}}_{n}) \leq S_{n} .

(A.8)

By (A.3), (A.7), and (A.8), we can conclude

E_{n} ({\hat{g}}_{n}) = E_{n} (\tilde{h}) = S_{n},

and hence, together with (A.6),

{\hat{g}}_{n}

becomes a solution to Problem (A) with

U_{n} = U^{*}

Next, we turn to the uniqueness of the solution to Problem (B). Let ${\hat{g}}_{n}$ and ${\tilde{g}}_{n}$ be two solutions to Problem (B). By Proposition 1, we have

{\hat{g}}_{n} - {\tilde{g}}_{n} = {\tilde{p}}_{n}

(A.9)

for some

{\tilde{p}}_{n} \in P_{m - 1}

. By the previous arguments,

{\hat{g}}_{n}

and

{\tilde{g}}_{n}

are also solutions to Problem (A) with

U_{n} = U^{*}

. By Proposition 1, we have

{\hat{g}}_{n} (X_{i}) = {\tilde{g}}_{n} (X_{i})

(A.10)

for

1 \leq i \leq n

. From (A.9) and (A.10), we have

{\tilde{p}}_{n} (X_{i}) = 0

for

1 \leq i \leq n

, which implies

{\tilde{p}}_{n} (x) = 0

for all

x \in R^{d}

by the

P_{m - 1}

-unisolvency of the

X_{i}

’s. Hence,

{\hat{g}}_{n} (x) = {\tilde{g}}_{n} (x)

for all

x \in R^{d}

□

Proof of Proposition 3.

Let $0 \leq U_{n} < J (f_{I})$ be given. By Proposition 1, there exists a solution ${\hat{f}}_{n}$ to Problem (A). Let $S^{*} = Ψ_{n} (U_{n})$ and $\tilde{h}$ be a solution to Problem (B) with $S_{n} = S^{*}$ . We will first show that ${\hat{f}}_{n}$ is also a solution to Problem (B) with $S_{n} = S^{*}$ and satisfies $J ({\hat{f}}_{n}) = U_{n}$ . First, we will prove that

U_{n} \leq J (\tilde{h}) .

(A.11)

To prove (A.11), we note that by Proposition 4, $\tilde{h}$ is a solution to

{minimize}_{f \in F_{m}} E_{n} (f) subject to J (f) \leq U_{n}

with

E_{n} (\tilde{h}) = S^{*}

. If, on the contrary,

J (\tilde{h}) < U_{n}

, then

\tilde{h}

is also a solution to

{minimize}_{f \in F_{m}} E_{n} (f) subject to J (f) \leq J (\tilde{h})

with

E_{n} (\tilde{h}) = S^{*}

, which contradicts the fact that

Ψ_{n}

is strictly decreasing over

[0, J (f_{I})]

. Thus, (A.11) is proven.

By Proposition 4, $\tilde{h}$ is a solution to Problem (A), so we have

E_{n} ({\hat{f}}_{n}) = E_{n} (\tilde{h}) = S^{*} .

(A.12)

By (A.12), ${\hat{f}}_{n}$ is a feasible solution to Problem (B) with $S_{n} = S^{*}$ , so we should have

J (\tilde{h}) \leq J ({\hat{f}}_{n}) .

(A.13)

Because ${\hat{f}}_{n}$ is a feasible solution to Problem (A), we have

J ({\hat{f}}_{n}) \leq U_{n} .

(A.14)

By combining (A.11), (A.13), and (A.14), we have

J (\tilde{h}) = J ({\hat{f}}_{n}) = U_{n} .

(A.15)

By (A.12) and (A.15), ${\hat{f}}_{n}$ is also a solution to Problem (B) with $S = S^{*}$ and satisfies $J ({\hat{f}}_{n}) = U_{n}$ .

We next turn to the uniqueness of the solution to Problem (A). Let ${\hat{f}}_{n}$ and ${\tilde{f}}_{n}$ be two solutions to Problem (A). By the previous arguments, ${\hat{f}}_{n}$ and ${\tilde{f}}_{n}$ are also two solutions to Problem (B) with $S_{n} = S^{*}$ . By Proposition 4, the solution to Problem (B) with $S_{n} = S^{*}$ is unique, so ${\hat{f}}_{n} = {\tilde{f}}_{n}$ . $□$

Proof of Proposition 5.

Combine Propositions 3 and 4. $□$

Proof of Proposition 6.

Let $H_{0} = span {p_{1}, \dots, p_{M}}$ and $H_{1}$ be the reproducing kernel Hilbert space with the $R (s, t)$ ’s as the reproducing kernels. Then, $F_{m}$ is the direct sum of $H_{0}$ and $H_{1}$ ; see Meinguet [50, equation 13 on p. 295 and equation 20 on p. 296]. Any element f of $F_{m}$ has the following representation:

f (x) = \sum_{i = 1}^{M} c_{i} p_{i} (x) + \sum_{i = 1}^{n} d_{i} R (x, X_{i}) + ρ (x)

for

x \in R^{d}

, where

ρ

is some element in

F_{m}

perpendicular to

p_{1} (\cdot), \dots, p_{M} (\cdot), R (\cdot, X_{1}), \dots, R (\cdot, X_{n})

because of the property of Hilbert spaces. Let

h (\cdot) = \sum_{i = 1}^{n} d_{i} R (\cdot, X_{i}) + ρ (\cdot)

. By Meinguet [50, equation 14 on p. 295], for

j \in {1, 2, \dots, n}

\begin{array}{l} h (X_{j}) = {(R (X_{j}, \cdot), h (\cdot))}_{m} \\ = {(R (X_{j}, \cdot), \sum_{i = 1}^{n} d_{i} R (\cdot, X_{i}) + ρ (\cdot))}_{m} \\ = \sum_{i = 1}^{n} d_{i} R (X_{j}, X_{i}) \end{array}

because

ρ

is perpendicular to

R (\cdot, X_{1}), \dots, R (\cdot, X_{n})

and

R (X_{j}, \cdot)

is the representer of evaluation at

X_{j}

. Hence, Problem (A) can be expressed as

{Minimize}_{f \in F_{m}} \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - (\sum_{j = 1}^{M} c_{j} p_{j} (X_{i}) + \sum_{j = 1}^{n} d_{j} R (X_{j}, X_{i})))^{2}

subject to

J (\sum_{i = 1}^{n} d_{i} R (\cdot, X_{i})) + J (ρ) \leq U_{n}

. Hence, if

f (x) = \sum_{i = 1}^{M} c_{i} p_{i} (x) + \sum_{i = 1}^{n} d_{i} R (x, X_{i}) + ρ (x)

is a solution to Problem (A), then

g (x) = \sum_{i = 1}^{M} c_{i} p_{i} (x) + \sum_{i = 1}^{n} d_{i} R (x, X_{i})

is also a solution to Problem (A). Also, for such g, note that

\begin{array}{l} J (g) = J (\sum_{i = 1}^{n} d_{i} R (\cdot, X_{i})) \\ = {(\sum_{i = 1}^{n} d_{i} R (\cdot, X_{i}), \sum_{i = 1}^{n} d_{i} R (\cdot, X_{i}))}_{m} \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} d_{i} d_{j} {(R (\cdot, X_{i}), R (\cdot, X_{j}))}_{m} \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} d_{i} d_{j} R (X_{i}, X_{j}) . □ \end{array}

Proof of Proposition 7.

f (x) = \sum_{i = 1}^{M} c_{i} p_{i} (x) + \sum_{i = 1}^{n} d_{i} R (x, X_{i}) + ρ (x)

for

x \in R^{d}

, where

ρ

is some element in

F

perpendicular to

p_{1} (\cdot), \dots, p_{M} (\cdot), R (\cdot, X_{1}), \dots, R (\cdot, X_{n})

because of the property of Hilbert spaces. By arguments similar to those in the proof of Proposition 6, Problem (B) can be expressed as

{Minimize}_{f \in F} J (\sum_{i = 1}^{n} d_{i} R (\cdot, X_{i})) + J (ρ)

subject to

\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - (\sum_{j = 1}^{M} c_{j} p_{j} (X_{i}) + \sum_{j = 1}^{n} d_{j} R (X_{i}, X_{j})))^{2} \leq S_{n}

. Hence,

J (ρ) = 0

and

ρ = 0

. The rest of the proposition follows from arguments similar to those in the proof of Proposition 6.

□

Proof of Lemma 2.

From the assumption that $τ$ is continuous on ${[a, b]}^{d}$ , the $X_{i}$ ’s are mutually distinct a.s. By Chui [10, theorem 9.4 on p. 135] or Chung and Yao [11], there exists a $P_{m - 1}$ -unisolvent set ${x_{1}^{*}, \dots, x_{r}^{*}}$ in ${[a, b]}^{d}$ with some fixed positive integer r.

Define a function $H : R^{r \times d} \to R$ by

H (x_{1}, \dots, x_{r}) = det (P_{x_{1}, \dots, x_{r}}^{T} P_{x_{1}, \dots, x_{r}})

for

(x_{1}, \dots, x_{r}) \in R^{r \times d}

, where

P_{x_{1}, \dots, x_{r}} = (\begin{matrix} p_{1} (x_{1}) & \dots & p_{M} (x_{1}) \\ ⋮ & ⋮ \\ p_{1} (x_{r}) & \dots & p_{M} (x_{r}) \end{matrix}) .

Then, $H (x_{1}, \dots, x_{r}) \neq 0$ at $(x_{1}, \dots, x_{r}) = (x_{1}^{*}, \dots, x_{r}^{*})$ because $P_{x_{1}^{*}, \dots, x_{r}^{*}}^{T}$ , $P_{x_{1}^{*}, \dots, x_{r}^{*}}$ is positive definite. Because H is continuous, there exists $ϵ_{0} > 0$ such that

‖ (x_{1}, \dots, x_{r}) - (x_{1}^{*}, \dots, x_{r}^{*}) ‖ \leq ϵ_{0}

implies

H (x_{1}, \dots, x_{r}) \neq 0

. Note that

H (x_{1}, \dots, x_{r}) \neq 0

implies that

{x_{1}, \dots, x_{r}}

is a

P_{m - 1}

-unisolvent. By Assumption 2, there exists a subset

{X_{1}^{*}, \dots, X_{r}^{*}} \subset {X_{1}, \dots, X_{n}}

such that

‖ X_{i}^{*} - x_{i}^{*} ‖ < ϵ_{0} / r

(A.16)

for

1 \leq i \leq r

for all n sufficiently large a.s. We note that (A.16) implies

‖ (X_{1}^{*}, \dots, X_{r}^{*}) - (x_{1}^{*}, \dots, x_{r}^{*}) ‖ \leq ϵ_{0},

and hence,

{X_{1}^{*}, \dots, X_{r}^{*}}

is a

P_{m - 1}

-unisolvent set.

□

Appendix B. Proofs of Theorems 1–4 and Corollaries 1–3

We start by defining a set of functions that will play an important role in the proofs of Theorems 1–4. We define $A_{m}$ by

\begin{array}{l} A_{m} = {f \in F_{m} : \int_{{[a, b]}^{d}} {D^{α} f (x)}^{2} d x \leq c_{A} + c_{B} + 1 for all α such that \\ | α | = m, | f (x) | \leq α_{0}, | f (x) - f (y) | \leq α_{1} ‖ x - y ‖^{δ} for x, y \in {[a, b]}^{d}} \end{array}

(B.1)

for some constants

α_{0}

and

α_{1}

, where

δ = 1

when

m \geq 2

and

δ = 1 / 2

when

m = 1

. We endow

A_{m}

with a metric

d_{\infty}

and a pseudometric

d_{n}

defined as follows:

d_{\infty} (f_{1}, f_{2}) = \sup_{x \in {[a, b]}^{d}} | f_{1} (x) - f_{2} (x) |

and

d_{n} (f_{1}, f_{2}) = {\frac{1}{n} \sum_{i = 1}^{n} (f_{1} (X_{i}) - f_{2} (X_{i}))^{2}}^{1 / 2}

for

f_{1}, f_{2} \in A_{m}

In Lemmas B.1 and B.2, we will establish that Assumption 7 guarantees that ${\hat{g}}_{n}$ , restricted to ${[a, b]}^{d}$ , belongs to $A_{m}$ for n sufficiently large a.s. In the proof of Lemma B.3, we will use the fact that $A_{m}$ can be covered by a finite number of balls with radius $ϵ$ in metric $d_{\infty}$ for any $ϵ > 0$ . In the proof of Theorem 3, we will use the fact that the number of balls of radius $ϵ$ needed to cover $A_{m}$ in metric $d_{n}$ is of order $\exp (ϵ^{- d / m})$ .

To prove Theorems 1 and 2, we need Lemmas B.1, B.2, B.3, and B.4. Here is our first lemma.

Lemma B.1.

Let $Ω$ be an open-bounded subset of $R^{d}$ containing ${[a, b]}^{d}$ . Under Assumptions 2, 3, 7, and 9, there exists a constant $α_{0}$ , depending on $Ω$ , such that

\sup_{x \in Ω} | {\hat{g}}_{n} (x) | \leq α_{0}

for n sufficiently large a.s.

Proof of Lemma B.1.

The proof is divided into three steps.

Step 1. We prove $(1 / n) \sum_{i = 1}^{n} {\hat{g}}_{n} {(X_{i})}^{2}$ is bounded for n sufficiently large (uniformly on n) a.s. To see this, one notices that by Assumption 3, Assumption 9, and the strong law of large numbers, we have
$\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} {\hat{g}}_{n} {(X_{i})}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} {({\hat{g}}_{n} (X_{i}) - Y_{i})}^{2} + \frac{2}{n} \sum_{i = 1}^{n} Y_{i}^{2} \\ \leq \frac{2}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} - \frac{4}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) + \frac{2}{n} \sum_{i = 1}^{n} ε_{i}^{2} + \frac{2}{n} \sum_{i = 1}^{n} Y_{i}^{2} \\ \leq 2 B_{n} + \frac{2}{n} \sum_{i = 1}^{n} ε_{i}^{2} + \frac{2}{n} \sum_{i = 1}^{n} Y_{i}^{2} \leq 2 σ^{2} + 2 E {(f_{*} (X) + ε)}^{2} + 1 ≜ M_{1} \end{array}$ (B.2)
for n sufficiently large a.s. We used the fact that $f_{*} \in F_{m}$ and $2 m > d$ , and hence, $f_{*}$ is continuous on $R^{d}$ and is bounded over ${[a, b]}^{d}$ .
Step 2. We use the Sobolev integral identity to express ${\hat{g}}_{n}$ as the sum of a polynomial of degree less than m and a bounded function over $Ω$ . To fill in the details, note that ${\hat{g}}_{n}$ is continuous over $R^{d}$ and hence, belongs to $L_{2} (Ω)$ when it is restricted to $Ω$ . By Assumption 7 and Adams and Fournier [1, theorem 2 on p. 717], ${\hat{g}}_{n}$ , restricted to $Ω$ , belongs to $W_{m} (Ω)$ for n sufficiently large a.s. The Sobolev integral identity (see Oden and Reddy [54, theorem 3.6 on p. 78]) allows us to express
${\hat{g}}_{n} (x) = {\hat{u}}_{n} (x) + {\hat{v}}_{n} (x)$
for all $x \in Ω$ , where ${\hat{u}}_{n}$ is a polynomial of degree less than m,
${\hat{v}}_{n} (x) = \int_{Ω} ‖ x - y ‖^{m - d} \sum_{| α | = m} Q_{α} (x, y) D^{α} {\hat{g}}_{n} (y) d y$
for $x \in Ω$ , and $Q_{α} (x, y)$ , $| α | = m$ , are bounded infinitely differentiable functions of $x$ and $y$ . Hölder’s inequality implies
$\begin{array}{l} | {\hat{v}}_{n} (x) | \leq \sum_{| α | = m} {\int_{Ω} ‖ x - y ‖^{2 m - 2 d} d y \int_{Ω} {Q_{α} (x, y) D^{α} {\hat{g}}_{n} (y)}^{2} d y}^{1 / 2} \\ \leq \max_{x, y \in Ω} | Q_{α} (x, y) | \sum_{| α | = m} {\int_{Ω} ‖ x - y ‖^{2 m - 2 d} d y \int_{Ω} {D^{α} {\hat{g}}_{n} (y)}^{2} d y}^{1 / 2} \end{array}$ (B.3)
for $x \in Ω$ .
Let $ρ$ be the radius of the smallest ball, centered at the origin, containing $Ω$ . Then, changing to the spherical coordinate system yields
$\int_{Ω} ‖ x - y ‖^{2 m - 2 d} d y \leq \int_{0}^{2 π} \int_{0}^{2 ρ} r^{2 (m - d)} r^{d - 1} d r d θ = 2 π {(2 ρ)}^{2 (m - d / 2)}$ (B.4)
by the fact that $2 m > d$ .
By Assumption 7, (B.3), and (B.4), there exists a constant $M_{2}$ such that
$\sup_{x \in Ω} | {\hat{v}}_{n} (x) | \leq M_{2}$ (B.5)
for n sufficiently large a.s.
Step 3. We use (B.5) and the $P_{m - 1}$ -unisolvency of ${X_{1}, \dots, X_{n}}$ to show that $\sup_{x \in Ω} | {\hat{u}}_{n} (x) |$ is bounded uniformly on n for n sufficiently large a.s.
First, combining (B.2) and (B.5) yields
$\frac{1}{n} \sum_{i = 1}^{n} {\hat{u}}_{n} {(X_{i})}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} {\hat{g}}_{n} {(X_{i})}^{2} + \frac{2}{n} \sum_{i = 1}^{n} {\hat{v}}_{n} {(X_{i})}^{2} \leq 2 M_{1} + 2 M_{2}^{2}$ (B.6)
for n sufficiently large a.s.

Second, we consider the following matrix:

Z_{n} = (\begin{matrix} p_{1} (X_{1}) & p_{2} (X_{1}) & \dots & p_{M} (X_{1}) \\ ⋮ & ⋮ \\ p_{1} (X_{n}) & p_{2} (X_{n}) & \dots & p_{M} (X_{n}) \end{matrix}) .

By the $P_{m - 1}$ -unisolvency of the $X_{i}$ ’s, the matrix $W ≜ (1 / n) Z_{n}^{T} Z_{n}$ is positive definite because $w^{T} W w = 0$ with $w^{T} = (w_{1}, \dots, w_{M})$ implies $\sum_{i = 1}^{M} w_{i} p_{i} (X_{j})$ $= 0$ for all $1 \leq j \leq n$ , which then implies $w_{i} = 0$ for all $1 \leq i \leq M$ . Let $λ_{1}, \dots, λ_{M}$ be the positive eigenvalues of W, and let $ρ_{1}, \dots, ρ_{M}$ be the corresponding eigenvectors with $‖ ρ_{i} ‖ = 1$ for $1 \leq i \leq M$ . Let $λ_{min}$ be the minimum of $λ_{1}, \dots, λ_{M}$ .

We next prove that

\underset{n \to \infty}{lim inf} λ_{min} \geq λ_{*} > 0

(B.7)

for some constant

λ_{*}

. Because

ρ_{i}^{T} W ρ_{i} = λ_{i}

for

1 \leq i \leq M

, we have

λ_{min} \geq \min_{‖ w ‖ = 1} w^{T} W w

, where

w^{T} = (w_{1}, \dots, w_{M})

. So,

λ_{min} \geq \min_{w_{1}^{2} + w_{1}^{2} + \dots + w_{M}^{2} = 1} \frac{1}{n} \sum_{i = 1}^{n} {(w_{1} p_{1} (X_{i}) + \dots + w_{M} p_{M} (X_{i}))}^{2} .

Note that $Λ = {(w_{1}, \dots, w_{M}) \in R^{M} : w_{1}^{2} + \dots + w_{M}^{2} = 1}$ is a nonempty closed subset of $R^{M}$ , and

\frac{1}{n} \sum_{i = 1}^{n} {(w_{1} p_{1} (X_{i}) + \dots + w_{M} p_{M} (X_{i}))}^{2} \to E {(w_{1} p_{1} (X) + \dots + w_{M} p_{M} (X))}^{2}

n \to \infty

uniformly on

Λ

because

Λ

is bounded and the

X_{i}

’s are in

{[a, b]}^{d}

By Shapiro et al. [63, proposition 5.2 on p. 157],

\min_{w_{1}^{2} + \dots + w_{M}^{2} = 1} \frac{1}{n} \sum_{i = 1}^{n} {(w_{1} p_{1} (X_{i}) + \dots + w_{M} p_{M} (X_{i}))}^{2} \to \min_{w_{1}^{2} + \dots + w_{M}^{2} = 1} E {(w_{1} p_{1} (X) + \dots + w_{M} p_{M} (X))}^{2}

a.s. as

n \to \infty

. Note that

\min_{w_{1}^{2} + \dots + w_{M}^{2} = 1} E {(w_{1} p_{1} (X) + \dots + w_{M} p_{M} (X))}^{2} \neq 0

because

E {(w_{1} p_{1} (X) + \dots + w_{M} p_{M} (X))}^{2} = 0

implies

w_{1} = \dots = w_{M} = 0

. Hence,

\underset{n \to \infty}{lim inf} λ_{min} \geq \min_{w_{1}^{2} + \dots + w_{M}^{2} = 1} E {(w_{1} p_{1} (X) + \dots + w_{M} p_{M} (X))}^{2} ≜ λ_{*} > 0,

proving (B.7).

We finally prove that ${\hat{u}}_{n}$ is bounded on $Ω$ uniformly on n for n sufficiently large. Let ${\hat{u}}_{n}$ be given by ${\hat{u}}_{n} = {\hat{a}}_{n, 1} p_{1} + \dots + {\hat{a}}_{n, M} p_{M}$ for some ${\hat{a}}_{n, 1}, \dots, {\hat{a}}_{n, M}$ . Let ${\hat{a}}_{n}^{T} ≜ ({\hat{a}}_{n, 1}, \dots, {\hat{a}}_{n, M})$ and note that the Rayleigh–Ritz theorem implies

\frac{1}{n} \sum_{i = 1}^{n} {\hat{u}}_{n} {(X_{i})}^{2} = {\hat{a}}_{n}^{T} W {\hat{a}}_{n} \geq λ_{min} ({\hat{a}}_{n, 1}^{2} + \dots + {\hat{a}}_{n, M}^{2}) \geq λ_{min} {\hat{a}}_{n, i}^{2}

(B.8)

for

1 \leq i \leq M

Combining (B.6), (B.7), and (B.8) yields

| {\hat{a}}_{n, i} | \leq \sqrt{(2 M_{1} + 2 M_{2}^{2}) / (λ_{*} / 2)}

(B.9)

for

1 \leq i \leq M

and all n sufficiently large a.s., and hence,

| {\hat{u}}_{n} (x) | = | {\hat{a}}_{n, 1} p_{1} (x) + \dots + {\hat{a}}_{n, M} p_{M} (x) | \leq M \sqrt{(2 M_{1} + 2 M_{2}^{2}) / (λ_{*} / 2)} \max_{1 \leq i \leq M, x \in Ω} | p_{i} (x) | ≜ M_{3}

(B.10)

for all

x \in Ω

and n sufficiently large a.s., proving Step 3.

Combining (B.5) and (B.10) completes the proof of Lemma B.1. $□$

Lemma B.2.

Under Assumptions 2, 3, 7, and 9, there exists a constant $α_{1}$ such that

| {\hat{g}}_{n} (x) - {\hat{g}}_{n} (y) | \leq α_{1} ‖ x - y ‖^{δ}

for any

x, y \in {[a, b]}^{d}

for n sufficiently large a.s., where

δ = 1

when

m \geq 2

and

δ = 1 / 2

when

m = 1

Proof of Lemma B.2.

We consider two cases separately: the case when $m = 1$ and the case when $m \geq 2$ .

Case 1. $m = 1$ . Because $2 m > d$ , it follows that $d = 1$ . The fundamental theorem of calculus and the continuity of ${\hat{g}}_{n}$ imply
${\hat{g}}_{n} (x) - {\hat{g}}_{n} (y) = \int_{y}^{x} {\hat{g}}_{n}^{(1)} (t) d t,$
where ${\hat{g}}_{n}^{(1)}$ is the weak derivative of ${\hat{g}}_{n}$ of the first order. By Assumption 7 and Hölder’s inequality,
$| {\hat{g}}_{n} (x) - {\hat{g}}_{n} (y) | \leq {\int_{y}^{x} {{\hat{g}}_{n}^{(1)} (t)}^{2} d t \int_{y}^{x} d t}^{1 / 2} \leq {(c_{B} + 1)}^{1 / 2} ‖ x - y ‖^{1 / 2}$
for $y \leq x$ and n sufficiently large a.s., proving Lemma B.2.
Case 2. $m \geq 2$ . Let $Ω$ be a bounded open subset of $R^{d}$ containing ${[a, b]}^{d}$ .

We first prove that $D^{α} {\hat{g}}_{n} (x)$ with $| α | = 1$ is bounded uniformly on n and $x$ for n sufficiently large a.s. By Oden and Reddy [54, equation 3.48 on p. 81] and Lemma B.1, ${lim sup}_{n \to \infty} ‖ {\hat{g}}_{n} ‖_{W_{n} (Ω)} < \infty$ a.s. Assumption 7 and Adams and Fournier [1, theorem 2 on p. 717] imply that there exists a constant $M_{4}$ such that

\max_{| β | = 1} \int_{Ω} {D^{β} {\hat{g}}_{n} (x)}^{2} d x \leq M_{4} and \max_{| β | = 2} \int_{Ω} {D^{β} {\hat{g}}_{n} (x)}^{2} d x \leq M_{4}

(B.11)

for n sufficiently large a.s.

When ${\hat{g}}_{n}$ is restricted to $Ω$ , $D^{α} {\hat{g}}_{n}$ with $| α | = 1$ belongs to $W_{1} (Ω)$ for n sufficiently large a.s. Applying the Sobolev integral identity (Oden and Reddy [54, theorem 3.6 on p. 68]) to $D^{α} {\hat{g}}_{n}$ with $| α | = 1$ yields

D^{α} {\hat{g}}_{n} (x) = \int_{Ω} ζ (y) D^{α} {\hat{g}}_{n} (y) d y + \int_{Ω} ‖ x - y ‖^{m - d} \sum_{| β | = 2} Q_{β} (x, y) D^{β} {\hat{g}}_{n} (y) d y

for

x \in Ω

, where

ζ (y)

is a continuous bounded function of

y

and

Q_{β} (x, y)

| β | = 2

, are bounded infinitely differentiable functions of

x

and

y

. Hence, for

x \in Ω

and

| α | = 1

, combining Hölder’s inequality, (B.4), and (B.11) yields

\begin{array}{l} | D^{α} {\hat{g}}_{n} (x) | \leq {\int_{Ω} ζ {(y)}^{2} d y}^{1 / 2} {\int_{Ω} D^{α} {\hat{g}}_{n} {(y)}^{2} d y}^{1 / 2} \\ + \sum_{| β | = 2} {\int_{Ω} ‖ x - y ‖^{2 m - 2 d} d y}^{1 / 2} {\int_{Ω} Q_{β} {(x, y)}^{2} {D^{β} {\hat{g}}_{n} (y)}^{2} d y}^{1 / 2} \leq M_{5} \end{array}

(B.12)

for some constant

M_{5}

and n sufficiently large a.s.

We next note that by the extended mean-value theorem, for any $x = (x_{1}, \dots, x_{d})$ and $y = (y_{1}, \dots, y_{d})$ , satisfying $x, x + y \in {[a, b]}^{d}$ , we have

{\hat{g}}_{n} (x + y) - {\hat{g}}_{n} (x) = \frac{\partial {\hat{g}}_{n}}{\partial x_{1}} (z_{1}) y_{1} + \dots + \frac{\partial {\hat{g}}_{n}}{\partial x_{d}} (z_{d}) y_{d}

for some

z = x + c y

with

c \in (0, 1)

, and hence,

| {\hat{g}}_{n} (x + y) - {\hat{g}}_{n} (x) | \leq {y_{1}^{2} + \dots + y_{d}^{2}}^{1 / 2} {\frac{\partial {\hat{g}}_{n}}{\partial x_{1}} {(z_{1})}^{2} + \dots + \frac{\partial {\hat{g}}_{n}}{\partial x_{d}} {(z_{d})}^{2}}^{1 / 2} \leq M_{6} ‖ y ‖ .

This is for some constant $M_{6}$ and n sufficiently large a.s.

Combining the two cases, we complete the proof of Lemma B.2. $□$

Lemma B.3.

Under Assumptions 2, 3, 7, and 9,

\frac{1}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) \to 0

n \to \infty

a.s.

Proof of Lemma B.3.

Let $ϵ > 0$ be given. Let $A_{m}$ be defined as in (B.1). Assumption 7, Lemma B.1, and Lemma B.2 imply that under Assumptions 2 and 3, any solution to Problem (B), restricted to ${[a, b]}^{d}$ , belongs to $A_{m}$ for n sufficiently large a.s. By Kolmogorov and Tihomirov [39, theorem 13], there exist $l ≜ l (ϵ)$ and $h_{1}, \dots, h_{l}$ in $A_{m}$ such that for any $g \in A_{m}$ , there is $i ≜ i (g) \in {1, \dots, l}$ satisfying

\sup_{x \in {[a, b]}^{d}} | g (x) - h_{i} (x) | \leq ϵ .

(B.13)

For each $j \in {1, \dots, l}$ , we have

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) = \frac{1}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i})) + \frac{1}{n} \sum_{i = 1}^{n} ε_{i} (h_{j} (X_{i}) - f_{*} (X_{i})) \\ \leq \frac{1}{n} \sum_{i = 1}^{n} | ε_{i} | \sup_{x \in {[a, b]}^{d}} | {\hat{g}}_{n} (x) - h_{j} (x) | + \max_{1 \leq j \leq l} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} (h_{j} (X_{i}) - f_{*} (X_{i})), \end{array}

and hence,

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) \\ \leq \min_{1 \leq j \leq l} (\sup_{x \in {[a, b]}^{d}} | {\hat{g}}_{n} (x) - h_{j} (x) |) (\frac{1}{n} \sum_{i = 1}^{n} | ε_{i} |) + \max_{1 \leq j \leq l} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} (h_{j} (X_{i}) - f_{*} (X_{i})) \\ \leq \frac{ϵ}{n} \sum_{i = 1}^{n} | ε_{i} | + \max_{1 \leq j \leq l} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} (h_{j} (X_{i}) - f_{*} (X_{i})) by (44) \\ \leq ϵ E | ε | + ϵ \end{array}

for n sufficiently large a.s. by the strong law of large numbers.

□

Lemma B.4.

Under Assumptions 2, 3, 7, and 9,

\frac{1}{n} \sum_{i = 1}^{n} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \to 0

n \to \infty

a.s.

Proof of Lemma B.4.

Note that

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) + S_{n} - \frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2} .

Lemma B.4 follows from Assumption 9 and Lemma B.3. $□$

Proof of Theorem 2.

We first use Lemma B.4 and the fact that $f_{*}$ is continuous over ${[a, b]}^{d}$ to conclude that ${\hat{g}}_{n}$ converges to $f_{*}$ uniformly over ${[a, b]}^{d}$ a.s.

Fix $ϵ > 0$ . We can find a finite number of sets $S_{1}, \dots, S_{l}$ covering ${[a, b]}^{d}$ , each having a diameter less than $ϵ$ (i.e., $\sup {‖ x - y ‖ : x, y \in S_{j}} \leq ϵ$ ) and having a nonempty interior. Because $f_{*} \in F_{m}$ and hence, is continuous over ${[a, b]}^{d}$ , Lemma B.2 implies that we can find a constant, say c, satisfying

| {\hat{g}}_{n} (x) - {\hat{g}}_{n} (y) | \leq c ‖ x - y ‖^{δ} and | f_{*} (x) - f_{*} (y) | \leq c ‖ x - y ‖^{δ}

for

x, y \in {[a, b]}^{d}

and n sufficiently large a.s., where

δ = 1

when

m \geq 2

and

δ = 1 / 2

when

m = 1

For each $x \in S_{j}$ and $X_{i} \in S_{j}$ ,

| {\hat{g}}_{n} (x) - f_{*} (x) | \leq | {\hat{g}}_{n} (x) - {\hat{g}}_{n} (X_{i}) | + | {\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}) | + | f_{*} (X_{i}) - f_{*} (x) | \leq c ϵ^{δ} + | {\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}) | + c ϵ^{δ},

\begin{array}{l} \sup_{x \in S_{j}} | {\hat{g}}_{n} (x) - f_{*} (x) | \leq 2 c ϵ^{δ} + \frac{\sum_{i = 1}^{n} | {\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}) | I (X_{i} \in S_{j})}{\sum_{i = 1}^{n} I (X_{i} \in S_{j})} \\ \leq 2 c ϵ^{δ} + \frac{1}{n} \sum_{i = 1}^{n} | {\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}) | \cdot \frac{n}{\sum_{i = 1}^{n} I (X_{i} \in S_{j}))} \\ \leq 2 c ϵ^{δ} + \sqrt{\frac{1}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))^{2}} \frac{n}{\sum_{i = 1}^{n} I (X_{i} \in S_{j}))} . \end{array}

By Lemma B.4 and Assumption 2, we conclude that

\underset{n \to \infty}{lim sup} \sup_{x \in S_{j}} | {\hat{g}}_{n} (x) - f_{*} (x) | \leq 0

a.s. Because there are finitely many

S_{j}

’s, we conclude that

\sup_{x \in {[a, b]}^{d}} | {\hat{g}}_{n} (x) - f_{*} (x) | \to 0

a.s. as

n \to \infty

We next use Lemmas B.1, B.2, and B.4 to establish

E {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} \to 0

(B.14)

n \to \infty

We need to use the truncated value defined as follows. For any $c \geq 0$ and $x \geq 0$ , the truncated value of x, denoted by $T_{c} (x)$ , is defined by

T_{c} (x) = {\begin{array}{l} x, & if x \geq c \\ 0, & otherwise . \end{array}

(B.15)

Note that $T_{c} (x^{2}) = {T_{\sqrt{c}} (x)}^{2}$ for $x \geq 0$ , $T_{c} (x) \leq T_{c} (y)$ for $0 \leq x \leq y$ , and $T (x + y) \leq T (x) + T (y)$ for $x, y \geq 0$ . Together with the fact that $| x + y | \leq | x | + | y |$ for $x, y \in R$ , we obtain

T_{c} {(x + y)}^{2} \leq T_{c} {(x)}^{2} + T_{c} (y^{2}) + 2 T_{\sqrt{c}} | x | \cdot T_{\sqrt{c}} | y |

(B.16)

- T_{c} {(x + y)}^{2} \leq - T_{c} {(x)}^{2} - T_{c} (y^{2}) + 2 T_{\sqrt{c}} | x | \cdot T_{\sqrt{c}} | y | .

(B.17)

We will prove that

E T_{c} {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} \to 0

(B.18)

n \to \infty

for each

c > 0

. Once (B.18) is proven, letting

c ↑ \infty

for each n will prove (B.14). To show (B.18), let

ϵ > 0

and

c > 0

be given. Let

A_{m}

be defined as in (B.1). Lemmas B.1 and B.2 imply that, under Assumption 2 and Assumption 3, any solution to Problem (B), restricted to

{[a, b]}^{d}

, belongs to

A_{m}

for n sufficiently large a.s. By Kolmogorov and Tihomirov [39, theorem 13], there exist

l ≜ l (ϵ)

and

h_{1}, \dots, h_{l}

A_{m}

such that for any

g \in A_{m}

, there is

i ≜ i (g) \in {1, \dots, l}

satisfying

\sup_{x \in {[a, b]}^{d}} | g (x) - h_{i} (x) | \leq ϵ .

(B.19)

For each $j \in {1, \dots, l}$ , we have

\begin{array}{l} E T_{c} {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \\ = E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X) + h_{j} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}) + h_{j} (X_{i}) - f_{*} (X_{i}))}^{2} \\ \leq E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X))}^{2} + E {(h_{j} (X) - f_{*} (X))}^{2} + 2 E T_{\sqrt{c}} | {\hat{g}}_{n} (X) - h_{j} (X) | \cdot T_{\sqrt{c}} | h_{j} (X) - f_{*} (X) | \\ - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {(h_{j} (X_{i}) - f_{*} (X_{i}))}^{2} \\ + \frac{2}{n} \sum_{i = 1}^{n} T_{\sqrt{c}} | {\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}) | \cdot T_{\sqrt{c}} | h_{j} (X_{i}) - f_{*} (X_{i}) | (B . 16) and (B . 17) \\ \leq \max_{1 \leq j \leq l} {E T_{c} {(h_{j} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {(h_{j} (X_{i}) - f_{*} (X_{i}))}^{2}} + E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X))}^{2} \\ + 2 \sqrt{E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X))}^{2}} \sqrt{E T_{c} {(h_{j} (X) - f_{*} (X))}^{2}} + \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}))}^{2} \\ + 2 \sqrt{\frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}))}^{2}} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} T_{c} {(h_{j} (X_{i}) - f_{*} (X_{i}))}^{2}} \end{array}

by the Cauchy–Schwarz inequality, so

\begin{array}{l} E T_{c} {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \\ \leq \max_{1 \leq j \leq l} {E T_{c} {(h_{j} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {(h_{j} (X_{i}) - f_{*} (X_{i}))}^{2}} + \min_{1 \leq j \leq l} E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X))}^{2} \\ + \min_{1 \leq j \leq l} 2 \sqrt{E T_{c} {({\hat{g}}_{n} (X) - h_{j} (X))}^{2}} \sqrt{E T_{c} {(h_{j} (X) - f_{*} (X))}^{2}} + \min_{1 \leq j \leq l} \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}))}^{2} \\ + \min_{1 \leq j \leq l} 2 \sqrt{\frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - h_{j} (X_{i}))}^{2}} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} T_{c} {(h_{j} (X_{i}) - f_{*} (X_{i}))}^{2}} \\ \leq ϵ + 2 ϵ^{2} + 4 \sqrt{c} ϵ \end{array}

a.s. for n sufficiently large, proving

E T_{c} {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} - \frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \to 0

n \to \infty

. By the dominated convergence theorem, we have

E T_{c} {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} - E [\frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}] \to 0

and

E [\frac{1}{n} \sum_{i = 1}^{n} T_{c} {({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}] \to 0

n \to \infty

, and hence, (B.18) follows.

To prove the last part of Theorem 2, let $j \in {1, \dots, m - 1}$ be fixed. By Assumption 2(ii), for any $α$ with $| α | = j$ ,

E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \leq τ_{2} \sum_{| β | \leq j} \int_{{[a, b]}^{d}} {(D^{β} {\hat{g}}_{n} (x) - D^{β} f_{*} (x))}^{2} d x .

(B.20)

By Adams and Fournier [1, third equality in theorem 2 on p. 717], there exists a constant C such that the right-hand side of (B.20) is less than or equal to

\begin{array}{l} τ_{2} C \sum_{| β | \leq m} {\int_{{[a, b]}^{d}} {(D^{β} {\hat{g}}_{n} (x) - D^{β} f_{*} (x))}^{2} d x}^{j / 2 m} {\int_{{[a, b]}^{d}} ({\hat{g}}_{n} (x) - f_{*} (x))^{2} d x}^{(m - j) / 2 m} \\ \leq (τ_{2} / τ_{1}) C \sum_{| β | \leq m} {\int_{{[a, b]}^{d}} {(D^{β} {\hat{g}}_{n} (x) - D^{β} f_{*} (x))}^{2} d x}^{j / 2 m} {E {({\hat{g}}_{n} (X) - f_{*} (X))}^{2}}^{(m - j) / 2 m} . \end{array}

By Adams and Fournier [1, first equality of theorem 2 on p. 717] and the fact that $J ({\hat{g}}_{n}) \leq c_{B} + 1$ for n sufficiently large a.s. and $E {({\hat{g}}_{n} (X) - f_{*} (X))}^{2} \to 0$ as $n \to \infty$ ,

\sum_{| β | \leq m} {\int_{{[a, b]}^{d}} {(D^{β} {\hat{g}}_{n} (x) - D^{β} f_{*} (x))}^{2} d x}^{j / 2 m}

is bounded for n sufficiently large a.s. Therefore,

E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2}

is bounded by a random variable, say

Z_{n}

, that converges to zero as

n \to \infty

a.s. By using

T_{c} E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \leq T_{c} (Z_{n})

for any

c > 0

and the dominated convergence theorem, we obtain

T_{c} E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0

n \to \infty

, and hence, it follows that

E {(D^{α} {\hat{g}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0

n \to \infty

□

Proof of Theorem 1.

We apply arguments similar to those in Theorem 2 and Lemmas B.1–B.4 to establish the following claims.

Step 1. By Assumption 6, $J ({\hat{f}}_{n}) \leq c_{A} + 1$ for n sufficiently large a.s.
Step 2. Use arguments similar to those in the proof of Lemma B.1 to establish that Assumptions 2, 3, 6, and 8 imply that there exists a constant $β_{0} ≜ β_{0} (Ω)$ such that $| {\hat{f}}_{n} (x) | \leq β_{0}$ for all $x \in Ω$ and n sufficiently large a.s. for any bounded open subset $Ω$ of $R^{d}$ containing ${[a, b]}^{d}$ .
Step 3. Use arguments similar to those in the proof of Lemma B.2 to establish that Assumptions 2, 3, 6, and 8 imply that there exists a constant $β_{1}$ such that $| {\hat{f}}_{n} (x) - {\hat{f}}_{n} (y) | \leq β_{1} ‖ x - y ‖^{δ}$ for $x, y \in {[a, b]}^{d}$ and n sufficiently large a.s., where $δ = 1$ when $m \geq 2$ and $δ = 1 / 2$ when $m = 1$ .
Step 4. Use arguments similar to those in the proof of Lemma B.3 to establish that Assumptions 2, 3, 6, and 8 imply $(1 / n) \sum_{i = 1}^{n} ε_{i} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i})) \to 0$ as $n \to \infty$ a.s.
Step 5. Use arguments similar to those in the proof of Lemma B.4 to establish that Assumptions 2, 3, 6, and 8 imply $(1 / n) \sum_{i = 1}^{n} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \to 0$ as $n \to \infty$ a.s.
Step 6. Use arguments similar to those in the proof of Theorem 2 to establish $\sup_{x \in {[a, b]}^{d}} | {\hat{f}}_{n} (x) - f_{*} (x) | \to 0$ as $n \to \infty$ a.s., $E {({\hat{f}}_{n} (X) - f_{*} (X))}^{2} \to 0$ as $n \to \infty$ , and $E {(D^{α} {\hat{f}}_{n} (X) - D^{α} f_{*} (X))}^{2} \to 0$ as $n \to \infty$ for any $α$ satisfying $| α | \leq m - 1$ . $□$

Proof of Corollary 1.

It suffices to prove that Assumption 10 implies Assumption 6 and that Assumption 11 implies Assumption 8. Because $J ({\hat{f}}_{n}) \leq U_{n}$ for all n, under Assumption 10, we have ${lim sup}_{n \to \infty} J ({\hat{f}}_{n}) \leq {lim sup}_{n \to \infty} U_{n} < \infty$ , so Assumption 6 holds. Note that Assumption 11 implies that $f_{*}$ is a feasible solution to Problem (A) for all n sufficiently large a.s., so $E_{n} ({\hat{f}}_{n}) \leq E_{n} (f_{*})$ , or equivalently, $\frac{1}{n} \sum_{i = 1}^{n} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))$ . $□$

Proof of Corollary 2.

It suffices to prove that Assumption 12 implies Assumption 9. Because $E_{n} ({\hat{g}}_{n}) \leq S_{n}$ ,

\frac{1}{n} \sum_{i = 1}^{n} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{g}}_{n} (X_{i}) - f_{*} (X_{i})) + S_{n} - \frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2} .

So, Assumption 12 implies Assumption 9. $□$

Proof of Theorem 3.

Let ${\tilde{f}}_{n}$ be a solution to

{Minimize}_{f \in A_{m}} E_{n} (f) subject to J (f) \leq U_{n} .

We note that

{\frac{1}{n} \sum_{i = 1}^{n} {({\tilde{f}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}}^{1 / 2} = O_{p} (n^{- m / (2 m + d)})

(B.21)

implies Theorem 3 because of Assumption 6 and Steps 2 and 3 in the proof of Theorem 1. Hence, the rest of the proof is devoted to proving (B.21).

By Assumption 13, we have

\frac{1}{n} \sum_{i = 1}^{n} {({\tilde{f}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\tilde{f}}_{n} (X_{i}) - f_{*} (X_{i})) .

(B.22)

Hence, if we can prove

\frac{1}{\sqrt{n}} | \sum_{i = 1}^{n} ε_{i} ({\tilde{f}}_{n} (X_{i}) - f_{*} (X_{i})) | / {\frac{1}{n} \sum_{i = 1}^{n} {({\tilde{f}}_{n} (X_{i}) - f_{*} (X_{i}))}^{2}}^{(1 / 2) (1 - d / 2 m)} = O_{p} (1),

(B.23)

then the combination of (B.22) and (B.23) will prove (B.21).

To establish (B.23), we introduce the notion of $ϵ$ -covering sets and the covering number of a metric or pseudometric space $(T, d)$ as follows. For any $ϵ > 0$ , an $ϵ$ -covering set of $(T, d)$ is a class of functions in T such that for any $f \in T$ , there exists $h \in T$ satisfying $d (f, h) < ϵ$ . The covering number $N (ϵ, T, d)$ of $(T, d)$ is the number of elements of a minimal covering set. In other words,

N (ϵ, T, d) = \min {L : There exists f_{1}, \dots, f_{L} such that T \subset \cup_{i = 1}^{L} B (f_{i}, ϵ)},

where

B (f, ϵ) = {g \in T : d (f, g) \leq ϵ}

for

f \in T

We note that there exists a positive constant $Γ$ such that for each $ϵ > 0$ ,

N_{\infty} (ϵ) ≜ \log (1 + N (ϵ, A_{m}, d_{\infty})) \leq Γ ϵ^{- d / m};

(B.24)

see, for example, Birman and Solomyak [5]. (B.24) implies

N_{n} (ϵ) ≜ \log (1 + N (ϵ, A_{m}, d_{n})) \leq Γ ϵ^{- d / m} .

(B.25)

(B.25) together with Step 2 in the proof of Theorem 1 and Assumption 4 implies (B.23) by van de Geer [69, lemma 8.4], which completes the proof of Theorem 3. $□$

Proof of Corollary 3.

In the proof of Corollary 2, we already proved that Assumption 10 implies Assumption 6. It suffices to prove that Assumption 11 implies Assumption 13. Under Assumption 11, $f_{*}$ is a feasible solution to Problem (A) for n sufficiently large a.s., so $E_{n} ({\hat{f}}_{n}) \leq E_{n} (f_{*})$ , or equivalently, $\frac{1}{n} \sum_{i = 1}^{n} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} ε_{i} ({\hat{f}}_{n} (X_{i}) - f_{*} (X_{i}))$ for all n sufficiently large a.s. $□$

Proof of Theorem 4.

Let $L^{2} = {f : R^{d} \to R : E (f {(X)}^{2}) < \infty}$ be equipped with the seminorm $| f | ≜ {E (f {(X)}^{2})}^{1 / 2}$ for $f \in L^{2}$ . It should be noted that $L^{2}$ is a semi-Hilbert space and that $P_{m - 1}$ is a closed subspace of $L^{2}$ . By Assumption 5(i), $f_{*} \in L^{2}$ . By the projection theorem, there exists $f_{\infty} \in P_{m - 1}$ minimizing $E {(f (X) - f_{*} (X))}^{2}$ over $P_{m - 1}$ . Let $ν^{*} ≜ E {(f_{\infty} (X) - f_{*} (X))}^{2}$ . To prove that $ν^{*} > 0$ , suppose, on the contrary, that $E {(f_{\infty} (X) - f_{*} (X))}^{2} = 0$ . By Assumption 5(iii) and Assumption 2, we reach $f_{*} (x) = f_{\infty} (x)$ for $x \in {[a, b]}^{d}$ , which contradicts Assumption 5(ii).

To prove the second part of Theorem 4, let $f_{P}$ be defined as in Lemma 1. We will show that for n sufficiently large,

0 < S_{n} \leq E_{n} (f_{P}),

(B.26)

a.s., which implies that there exists a unique solution

{\hat{g}}_{n}

to Problem (B) for n sufficiently large a.s.

To show (B.26), we will first show that

E_{n} (f_{P}) \to min_{(c_{1}, \dots, c_{M}) \in R^{M}} E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X))}^{2} + σ^{2}

(B.27)

n \to \infty

. To prove (B.27), we first note that

f_{P}

is a solution to

\begin{array}{l} \min_{(c_{1}, \dots, c_{M}) \in R^{M}} \frac{1}{n} \sum_{i = 1}^{n} {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X_{i}))}^{2} \\ - \frac{2}{n} \sum_{i = 1}^{n} ε_{i} (c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X_{i})) + \frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2} \end{array}

and that

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X_{i}))}^{2} \\ - \frac{2}{n} \sum_{i = 1}^{n} ε_{i} (c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X_{i})) + \frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2} \\ \to E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X))}^{2} + σ^{2} \end{array}

n \to \infty

a.s. by Assumption 3(ii), Assumption 3(iii), Assumption 5(i), and the strong law of large numbers. We then use Shapiro et al. [63, theorem 5.4 on p. 159] and the fact that

E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X))}^{2} + σ^{2}

(B.28)

is convex in

(c_{1}, \dots, c_{M})

. The only nontrivial condition of Shapiro et al. [63, theorem 5.4] is that the set of solutions to (B.28) is nonempty and bounded. In the first part of this proof, we showed that there exists a solution to (B.28). To prove that the set of solutions to (B.28) is bounded, we notice that

E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X))}^{2}

is positive definite because of Assumption 2 and the fact that

{[a, b]}^{d}

contains a

P_{m - 1}

-unisolvent set, and hence,

E {(c_{1} p_{1} (X) + \dots + c_{M} p_{M} (X) - f_{*} (X))}^{2}

is coercive over

(c_{1}, \dots, c_{M}) \in R^{M}

. We then conclude that the set of solutions to (B.28) is bounded. Thus, the application of Shapiro et al. [63, theorem 5.4 on p. 159] yields (B.27). Because the minimizing value of (B.28) is

σ^{2} + ν^{*}

and

{lim sup}_{n \to \infty} S_{n} < σ^{2} + ν^{*}

a.s., we reach (B.26).

□

References

[1] Adams RA, Fournier J (1977) Cone conditions and properties of Sobolev spaces. J. Math. Anal. Appl. 61(3):713–734.Crossref, Google Scholar
[2] Adams RA, Fournier JJF (2003) Sobolev Spaces, 2nd ed. (Academic Press, Amsterdam).Google Scholar
[3] Ankenman BE, Nelson BL, Staum J (2010) Stochastic kriging for simulation metamodeling. Oper. Res. 58(2):371–382.Link, Google Scholar
[4] Bertsimas D, Mundru N (2020) Sparse convex regression. INFORMS J. Comput. 33(1):262–279.Link, Google Scholar
[5] Birman MS, Solomyak MZ (1967) Piecewise-polynomial approximations of functions of the classes $W_{p}^{α}$ . Math. USSR-Sbornik 2(3):295–317.Crossref, Google Scholar
[6] Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[7] Brunk HD (1958) On the estimation of parameters restricted by inequalities. Ann. Math. Statist. 29(2):437–454.Crossref, Google Scholar
[8] Brunk HD (1970) Estimation of isotonic regression. Puri ML, ed. Nonparametric Techniques in Statistical Inference (Cambridge University Press, London), 177–197.Google Scholar
[9] Chen X, Ankenman BE, Nelson BL (2013) Enhancing stochastic kriging metamodels with gradient estimators. Oper. Res. 61(2):512–528.Link, Google Scholar
[10] Chui CK (1988) Multivariate Splines (SIAM, Philadelphia).Crossref, Google Scholar
[11] Chung KC, Yao TH (1977) On lattices admitting unique Lagrange interpolations. SIAM J. Numerical Anal. 14(4):735–743.Crossref, Google Scholar
[12] Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74(368):829–836.Crossref, Google Scholar
[13] Cox DD (1983) Asymptotics for M-type smoothing splines. Ann. Statist. 11(2):530–551.Crossref, Google Scholar
[14] Cox DD (1984) Multivariate smoothing spline functions. SIAM J. Numerical Anal. 21(4):789–813.Crossref, Google Scholar
[15] Craven P, Wahba G (1979) Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31:377–403.Crossref, Google Scholar
[16] de Boor C (1978) A Practical Guide to Splines (Springer, New York).Crossref, Google Scholar
[17] Dierckx P (1993) Curve and Surface Fitting with Splines (Oxford University Press, New York).Crossref, Google Scholar
[18] Donoho DL, Johnstone IM (1994) Ideal spatial adaptation via wavelet shrinkage. Biometrika 81(3):425–455.Crossref, Google Scholar
[19] Duchon J (1979) Splines minimizing rotation invariant semi-norms in Sobolev spaces. Schempp W, Zeller K, eds. Multivariate Approximation Theory (Birkhäuser-Verlag, Basel, Switzerland), 85–100.Google Scholar
[20] Eubank RL (1999) Nonparametric Regression and Spline Smoothing (Marcel Dekker, New York).Crossref, Google Scholar
[21] Franke R (1982) Scattered data interpolation: Tests of some methods. Math. Comput. 38(157):181–200.Google Scholar
[22] Grant M, Boyd S (2014) CVX: MATLAB software for disciplined convex programming, version 2.1. Accessed April 17, 2024, http://cvxr.com/cvx.Google Scholar
[23] Green PJ, Silverman BW (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach (Chapman & Hall, London).Crossref, Google Scholar
[24] Green A, Balakrishnan S, Tibshirani R (2021) Minimax optimal regression over Sobolev spaces via Laplacian regularization on neighborhood graphs. Banerjee A, Fukumizu K, eds. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 2602–2610.Google Scholar
[25] Greville TNE (1969) Theory and Applications of Spline Functions (Academic Press, New York).Google Scholar
[26] Groeneboom P, Jongbloed G, Wellner JA (2001) Estimation of a convex function: Characterization and asymptotic theory. Ann. Statist. 29(6):1653–1698.Crossref, Google Scholar
[27] Györfi L, Kohler M, Krzyżak A, Walk H (2002) A Distribution-Free Theory of Nonparametric Regression (Springer, New York).Crossref, Google Scholar
[28] Hanson DL, Pledger G (1976) Consistency in concave regression. Ann. Statist. 4(6):1038–1050.Crossref, Google Scholar
[29] Härdle W (1990) Applied Nonparametric Regression (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[30] Hildreth C (1954) Point estimates of ordinates of concave functions. J. Amer. Statist. Assoc. 49(267):598–619.Crossref, Google Scholar
[31] Hillier FS, Lieberman GJ (1967) Introduction to Operations Research, 9th ed. (Holden-Day, San Francisco).Google Scholar
[32] Hull JC (2006) Options, Futures, and Other Derivatives (Prentice Hall, Hoboken, NJ).Google Scholar
[33] Hutchinson MF, de Hoog FR (1985) Noisy data with spline functions. Numerische Mathematik 47:99–106.Crossref, Google Scholar
[34] Johnson AL, Jiang DR (2018) Shape constraints in economics and operations research. Statist. Sci. 33(4):527–546.Crossref, Google Scholar
[35] Kersey SN (2003) On the problems of smoothing and near-interpolation. Math. Comput. 72(244):1873–1885.Crossref, Google Scholar
[36] Keshvari A (2018) Segmented concave least squares: A nonparametric piecewise linear regression. Eur. J. Oper. Res. 266(2):585–594.Crossref, Google Scholar
[37] Kohler M, Krzyżak A, Walk H (2006) Rates of convergence for partitioning and nearest neighbor regression estimates with unbounded data. J. Multivariate Anal. 97(2):311–323.Crossref, Google Scholar
[38] Kohler M, Krzyżak A, Walk H (2009) Optimal global rates of convergence for nonparametric regression with unbounded data. J. Statist. Planning Inference 139(4):1286–1296.Crossref, Google Scholar
[39] Kolmogorov AN, Tihomirov VM (1961) $ϵ$ -Entropy and $ϵ$ -capacity of sets in functional spaces. Amer. Math. Soc. Translations 2(17):277–364.Crossref, Google Scholar
[40] Kuosmanen T (2008) Representation theorem for convex nonparametric least squares. Econom. J. 11(2):308–325.Crossref, Google Scholar
[41] Kuosmanen T, Johnson AL (2010) Data envelopment analysis as nonparametric least squares regression. Oper. Res. 58(1):149–160.Link, Google Scholar
[42] Kuosmanen T, Johnson AL (2017) Modeling joint production of multiple outputs in StoNED: Directional distance function approach. Eur. J. Oper. Res. 262(2):792–801.Crossref, Google Scholar
[43] Lee C, Johnson AL, Moreno-Centeno E, Kuosmanen T (2013) A more efficient algorithm for convex nonparametric least squares. Eur. J. Oper. Res. 227(2):391–400.Crossref, Google Scholar
[44] Lim E (2020) The limiting behavior of isotonic and convex regression estimators when the model is misspecified. Electronic J. Statist. 14(1):2053–2097.Crossref, Google Scholar
[45] Lim E (2021) Consistency of penalized convex regression. Internat. J. Statist. Probab. 10(1):69–78.Crossref, Google Scholar
[46] Lim E, Glynn PW (2012) Consistency of multidimensional convex regression. Oper. Res. 60(1):196–208.Link, Google Scholar
[47] Luo Z, Wahba G (1997) Hybrid adaptive splines. J. Amer. Statist. Assoc. 92(437):107–116.Crossref, Google Scholar
[48] Mammen E (1991) Nonparametric regression under qualitative smoothness assumptions. Ann. Statist. 19(2):741–759.Crossref, Google Scholar
[49] Mazumder R, Choudhury A, Iyengar G, Sen B (2019) A computational framework for multivariate convex regression and its variants. J. Amer. Statist. Assoc. 114(525):318–331.Crossref, Google Scholar
[50] Meinguet J (1979) Multivariate interpolation at arbitrary points made simple. Z. Angew. Math. Phys. 30:292–304.Crossref, Google Scholar
[51] Meinguet J (1984) Surface spline interpolation: Basic theory and computational aspects. Singh SP, Burry JWH, Watson B, eds. Approximation Theory and Spline Functions, NATO ASI Series, vol. 136 (Springer, Dordrecht, Netherlands), 127–142.Crossref, Google Scholar
[52] Myers RH, Montgomery DC (2002) Response Surface Methodology: Process and Product Optimization Using Designed Experiments (Wiley, New York).Google Scholar
[53] Nadaraya EA (1964) On estimating regression. Theory Probab. Appl. 9(1):141–142.Crossref, Google Scholar
[54] Oden JT, Reddy JN (1976) An Introduction to the Mathematical Theory of Finite Elements (Wiley, New York).Google Scholar
[55] Reinsch CH (1967) Smoothing by spline functions. Numerische Mathematik 10:177–183.Crossref, Google Scholar
[56] Reinsch CH (1971) Smoothing by spline functions II. Numerische Mathematik 16:451–454.Crossref, Google Scholar
[57] Rice J, Rosenblatt M (1983) Smoothing splines: Regression, derivatives and deconvolution. Ann. Statist. 11(1):141–156.Crossref, Google Scholar
[58] Ruppert D, Wand MP (1994) Multivariate locally weighted least squares regression. Ann. Statist. 22(3):1346–1370.Crossref, Google Scholar
[59] Salemi P, Nelson BL, Staum J (2016) Moving least squares regression for high-dimensional stochastic simulation metamodeling. ACM Trans. Model. Comput. Simulation 26:16:1–16:25.Crossref, Google Scholar
[60] Schoenberg IJ (1964) Spline functions and the problem of graduation. Proc. Natl. Acad. Sci. USA 52(4):947–950.Crossref, Google Scholar
[61] Seijo E, Sen B (2011) Nonparametric least squares estimation of a multivariate convex regression function. Ann. Statist. 39(3):1633–1657.Crossref, Google Scholar
[62] Shapiro A (2000) On the asymptotics of constrained local M-estimators. Ann. Statist. 28(3):948–960.Crossref, Google Scholar
[63] Shapiro A, Dentcheva D, Ruszczyński AP (2009) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
[64] Stone CJ (1977) Consistent nonparametric regression. Ann. Statist. 5(4):595–645.Crossref, Google Scholar
[65] Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8(6):1348–1360.Crossref, Google Scholar
[66] Utreras FI (1981) Optimal smoothing of noisy data using spline functions. SIAM J. Sci. Statist. Comput. 2(3):349–362.Crossref, Google Scholar
[67] Utreras FI (1988) Convergence rates for multivariate smoothing spline functions. J. Approx. Theory 52(1):1–27.Crossref, Google Scholar
[68] van de Geer S (1990) Estimating a regression function. Ann. Statist. 18(2):907–924.Crossref, Google Scholar
[69] van de Geer S (2000) Empirical Process in M-Estimation (Cambridge University Press, Cambridge, UK).Google Scholar
[70] Varian HR (1985) Non-parametric analysis of optimizing behavior with measurement error. J. Econometrics 30(1):445–458.Crossref, Google Scholar
[71] Wahba G (1990) Spline Models for Observational Data (SIAM, Philadelphia).Crossref, Google Scholar
[72] Wand MP, Jones MC (1995) Kernel Smoothing (Chapman & Hall, London).Crossref, Google Scholar
[73] Watson GS (1964) Smooth regression analysis. Sankhya Ser. A 26(4):359–372.Google Scholar
[74] Wegman EJ, Wright IW (1983) Splines in statistics. J. Amer. Statist. Assoc. 78(382):351–365.Crossref, Google Scholar
[75] Yagi D, Chen Y, Johnson AL, Kuosmanen T (2020) Shape-constrained kernel-weighted least squares: Estimating production functions for Chilean manufacturing industries. J. Bus. Econom. Statist. 38(1):43–54.Crossref, Google Scholar
[76] Zangwill WI (1969) Nonlinear Programming: A Unified Approach (Prentice-Hall, Hoboken, NJ).Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 2

May 2025

Pages iii, 783-1583

Article Information

Metrics

Information

Received:May 28, 2020
Accepted:March 25, 2024
Published Online:May 02, 2024

Cite as

Eunji Lim (2024) Estimating a Function and Its Derivatives Under a Smoothness Condition. Mathematics of Operations Research 50(2):1112-1138.

https://doi.org/10.1287/moor.2020.0161

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Estimating a Function and Its Derivatives Under a Smoothness Condition

Abstract

1. Introduction

1.1. Literature Review

1.2. Organization of This Paper

2. Definitions and Preliminaries

3. Relationships Between Problems (A) and (B)

4. Convex Programming Formulations for Problems (A) and (B)

5. Consistency and Rates of Convergence

6. Numerical Results

6.1. Problem (B)

6.2. Problem (A)

6.3. Problem (C)

6.4. How to Select m

6.5. M/M/1 Queue

6.6. Stock Trader’s Problem

6.7. Observations from Numerical Experiments

6.8. When Multiple Observations Cannot Be Made at a Fixed Design Point in ${[a, b]}^{d}$

7. Conclusions

7.1. How to Choose $S_{n}$ When Multiple Observations Are Not Available

7.2. Combining the Smoothness Condition with Shape Constraints

7.3. Establishing Consistency Under a Condition That Is Easier to Verify

7.4. Extending Our Formulations to Shape-Constrained Estimation Problems

Appendix A. Proofs of Propositions 1–7 and Lemmas 1 and 2

Appendix B. Proofs of Theorems 1–4 and Corollaries 1–3

References

Volume 50, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News

Available Issues

Available Issues

Estimating a Function and Its Derivatives Under a Smoothness Condition

Abstract

1. Introduction

1.1. Literature Review

1.2. Organization of This Paper

2. Definitions and Preliminaries

3. Relationships Between Problems (A) and (B)

4. Convex Programming Formulations for Problems (A) and (B)

5. Consistency and Rates of Convergence

6. Numerical Results

6.1. Problem (B)

6.2. Problem (A)

6.3. Problem (C)

6.4. How to Select m

6.5. M/M/1 Queue

6.6. Stock Trader’s Problem

6.7. Observations from Numerical Experiments

6.8. When Multiple Observations Cannot Be Made at a Fixed Design Point in [a,b]d

7. Conclusions

7.1. How to Choose Sn When Multiple Observations Are Not Available

7.2. Combining the Smoothness Condition with Shape Constraints

7.3. Establishing Consistency Under a Condition That Is Easier to Verify

7.4. Extending Our Formulations to Shape-Constrained Estimation Problems

Appendix A. Proofs of Propositions 1–7 and Lemmas 1 and 2

Appendix B. Proofs of Theorems 1–4 and Corollaries 1–3

References

Volume 50, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

6.8. When Multiple Observations Cannot Be Made at a Fixed Design Point in ${[a, b]}^{d}$

7.1. How to Choose $S_{n}$ When Multiple Observations Are Not Available