Foundations of Computational Mathematics, volume 18, issue 4, pages 971-1013

Optimal Rates for Regularization of Statistical Inverse Learning Problems

Publication typeJournal Article
Publication date2017-06-20
scimago Q1
SJR2.546
CiteScore6.9
Impact factor2.5
ISSN16153375, 16153383
Computational Mathematics
Computational Theory and Mathematics
Applied Mathematics
Analysis
Abstract
We consider a statistical inverse learning (also called inverse regression) problem, where we observe the image of a function f through a linear operator A at i.i.d. random design points $$X_i$$ , superposed with an additive noise. The distribution of the design points is unknown and can be very general. We analyze simultaneously the direct (estimation of Af) and the inverse (estimation of f) learning problems. In this general framework, we obtain strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions. This improves on or completes previous results obtained in related settings. The optimality of the obtained rates is shown not only in the exponent in n but also in the explicit dependency of the constant factor in the variance of the noise and the radius of the source condition set.
Blanchard G., Krämer N.
Analysis and Applications scimago Q1 wos Q1
2016-09-09 citations by CoLab: 17 Abstract  
We prove statistical rates of convergence for kernel-based least squares regression from i.i.d. data using a conjugate gradient (CG) algorithm, where regularization against overfitting is obtained by early stopping. This method is related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. Following the setting introduced in earlier related literature, we study so-called “fast convergence rates” depending on the regularity of the target regression function (measured by a source condition in terms of the kernel integral operator) and on the effective dimensionality of the data mapped into the kernel space. We obtain upper bounds, essentially matching known minimax lower bounds, for the ℒ2 (prediction) norm as well as for the stronger Hilbert norm, if the true regression function belongs to the reproducing kernel Hilbert space. If the latter assumption is not fulfilled, we obtain similar convergence rates for appropriate norms, provided additional unlabeled data are available.
Loustau S., Marteau C.
Bernoulli scimago Q1 wos Q2
2015-02-01 citations by CoLab: 8
Loustau S.
2013-01-01 citations by CoLab: 5 Abstract  
Let $(X,Y)\in\mathcal{X}\times\mathcal{Y}$ be a random couple with unknown distribution $P$. Let $\mathcal{G}$ be a class of measurable functions and $\ell$ a loss function. The problem of statistical learning deals with the estimation of the Bayes: \[g^{*}=\arg\min_{g\in\mathcal{G}}\mathbb{E}_{P}\ell(g,(X,Y)).\] In this paper, we study this problem when we deal with a contaminated sample $(Z_{1},Y_{1}),\dots,(Z_{n},Y_{n})$ of i.i.d. indirect observations. Each input $Z_{i}$, $i=1,\dots,n$ is distributed from a density $Af$, where $A$ is a known compact linear operator and $f$ is the density of the direct input $X$. We derive fast rates of convergence for the excess risk of empirical risk minimizers based on regularization methods, such as deconvolution kernel density estimators or spectral cut-off. These results are comparable to the existing fast rates in Koltchinskii (2006) for the direct case. It gives some insights into the effect of indirect measurements in the presence of fast rates of convergence.
Wahba G.
2011-09-30 citations by CoLab: 3449
Wang C., Zhou D.
Journal of Complexity scimago Q1 wos Q1
2011-02-01 citations by CoLab: 45 Abstract  
A standard assumption in theoretical study of learning algorithms for regression is uniform boundedness of output sample values. This excludes the common case with Gaussian noise. In this paper we investigate the learning algorithm for regression generated by the least squares regularization scheme in reproducing kernel Hilbert spaces without the assumption of uniform boundedness for sampling. By imposing some incremental conditions on moments of the output variable, we derive learning rates in terms of regularity of the regression function and capacity of the hypothesis space. The novelty of our analysis is a new covering number argument for bounding the sample error.
CAPONNETTO A., YAO Y.
Analysis and Applications scimago Q1 wos Q1
2010-04-20 citations by CoLab: 47 Abstract  
We consider learning algorithms induced by regularization methods in the regression setting. We show that previously obtained error bounds for these algorithms, using a priori choices of the regularization parameter, can be attained using a suitable a posteriori choice based on cross-validation. In particular, these results prove adaptation of the rate of convergence of the estimators to the minimax rate induced by the "effective dimension" of the problem. We also show universal consistency for this broad class of methods which includes regularized least-squares, truncated SVD, Landweber iteration and ν-method.
Mendelson S., Neeman J.
Annals of Statistics scimago Q1 wos Q1
2009-12-31 citations by CoLab: 47 Abstract  
Under mild assumptions on the kernel, we obtain the best known error rates in a regularized learning scenario taking place in the corresponding reproducing kernel Hilbert space (RKHS). The main novelty in the analysis is a proof that one can use a regularization term that grows significantly slower than the standard quadratic growth in the RKHS norm.
Tsybakov A.B.
2009-01-01 citations by CoLab: 890
Girosi F., Jones M., Poggio T.
Neural Computation scimago Q1 wos Q3
2008-04-04 citations by CoLab: 968 Abstract  
We had previously shown that regularization principles lead to approximation schemes that are equivalent to networks with one layer of hidden units, called regularization networks. In particular, standard smoothness functionals lead to a subclass of regularization networks, the well known radial basis functions approximation schemes. This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we introduce new classes of smoothness functionals that lead to different classes of basis functions. Additive splines as well as some tensor product splines can be obtained from appropriate classes of smoothness functionals. Furthermore, the same generalization that extends radial basis functions (RBF) to hyper basis functions (HBF) also leads from additive models to ridge approximation models, containing as special cases Breiman's hinge functions, some forms of projection pursuit regression, and several types of neural networks. We propose to use the term generalized regularization networks for this broad class of approximation schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the different classes of basis functions correspond to different classes of prior probabilities on the approximating function spaces, and therefore to different types of smoothness assumptions. In summary, different multilayer networks with one hidden layer, which we collectively call generalized regularization networks, correspond to different classes of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are (1) radial basis functions that can be generalized to hyper basis functions, (2) some tensor product splines, and (3) additive splines that can be generalized to schemes of the type of ridge approximation, hinge functions, and several perceptron-like neural networks with one hidden layer.
Gerfo L.L., Rosasco L., Odone F., Vito E.D., Verri A.
Neural Computation scimago Q1 wos Q3
2008-02-06 citations by CoLab: 58 Abstract  
We discuss how a large class of regularization methods, collectively known as spectral regularization and originally designed for solving ill-posed inverse problems, gives rise to regularized learning algorithms. All of these algorithms are consistent kernel methods that can be easily implemented. The intuition behind their derivation is that the same principle allowing for the numerical stabilization of a matrix inversion problem is crucial to avoid overfitting. The various methods have a common derivation but different computational and theoretical properties. We describe examples of such algorithms, analyze their classification performance on several data sets and discuss their applicability to real-world problems.
Bissantz N., Hohage T., Munk A., Ruymgaart F.
2007-12-07 citations by CoLab: 129 Abstract  
Previously, the convergence analysis for linear statistical inverse problems has mainly focused on spectral cut-off and Tikhonov-type estimators. Spectral cut-off estimators achieve minimax rates for a broad range of smoothness classes and operators, but their practical usefulness is limited by the fact that they require a complete spectral decomposition of the operator. Tikhonov estimators are simpler to compute but still involve the inversion of an operator and achieve minimax rates only in restricted smoothness classes. In this paper we introduce a unifying technique to study the mean square error of a large class of regularization methods (spectral methods) including the aforementioned estimators as well as many iterative methods, such as $\nu$-methods and the Landweber iteration. The latter estimators converge at the same rate as spectral cut-off but require only matrix-vector products. Our results are applied to various problems; in particular we obtain precise convergence rates for satellite gradiometry, $L^2$-boosting, and errors in variable problems.
Yao Y., Rosasco L., Caponnetto A.
Constructive Approximation scimago Q1 wos Q1
2007-04-04 citations by CoLab: 574 Abstract  
In this paper we study a family of gradient descent algorithms to approximate the regression function from reproducing kernel Hilbert spaces (RKHSs), the family being characterized by a polynomial decreasing rate of step sizes (or learning rate). By solving a bias-variance trade-off we obtain an early stopping rule and some probabilistic upper bounds for the convergence of the algorithms. We also discuss the implication of these results in the context of classification where some fast convergence rates can be achieved for plug-in classifiers. Some connections are addressed with Boosting, Landweber iterations, and the online learning algorithms as stochastic approximations of the gradient descent method.
Smale S., Zhou D.
Constructive Approximation scimago Q1 wos Q1
2007-03-15 citations by CoLab: 335 Abstract  
The regression problem in learning theory is investigated with least square Tikhonov regularization schemes in reproducing kernel Hilbert spaces (RKHS). We follow our previous work and apply the sampling operator to the error analysis in both the RKHS norm and the L2 norm. The tool for estimating the sample error is a Bennet inequality for random variables with values in Hilbert spaces. By taking the Hilbert space to be the one consisting of Hilbert-Schmidt operators in the RKHS, we improve the error bounds in the L2 metric, motivated by an idea of Caponnetto and de Vito. The error bounds we derive in the RKHS norm, together with a Tsybakov function we discuss here, yield interesting applications to the error analysis of the (binary) classification problem, since the RKHS metric controls the one for the uniform convergence.
Temlyakov V.N.
Constructive Approximation scimago Q1 wos Q1
2007-02-09 citations by CoLab: 17 Abstract  
This paper addresses some problems of supervised learning in the setting formulated by Cucker and Smale. Supervised learning, or learning-from-examples, refers to a process that builds on the base of available data of inputs xi and outputs yi, i = 1,...,m, a function that best represents the relation between the inputs x ∈ X and the corresponding outputs y ∈ Y. The goal is to find an estimator fz on the base of given data z := ((x1,y1),...,(xm,ym)) that approximates well the regression function fρ (or its projection) of an unknown Borel probability measure ρ defined on Z = X × Y. We assume that (xi,yi), i = 1,...,m, are independent and distributed according to ρ. We discuss the following two problems: I. the projection learning problem (improper function learning problem); II. universal (adaptive) estimators in the proper function learning problem. In the first problem we do not impose any restrictions on a Borel measure ρ except our standard assumption that |y|≤ M a.e. with respect to ρ. In this case we use the data z to estimate (approximate) the L2(ρX) projection (fρ)W of fρ onto a function class W of our choice. Here, ρX is the marginal probability measure. In [KT1,2] this problem has been studied for W satisfying the decay condition εn(W,B) ≤ Dn-r of the entropy numbers εn(W,B) of W in a Banach space B in the case B = C(X) or B = L2(\rhoX). In this paper we obtain the upper estimates in the case εn(W,L1(ρX)) ≤ Dn-r with an extra assumption that W is convex. In the second problem we assume that an unknown measure ρ satisfies some conditions. Following the standard way from nonparametric statistics we formulate these conditions of the form fρ ∈ Θ. Next, we assume that the only a priori information available is that fρ belongs to a class Θ (unknown) from a known collection {Θ} of classes. We want to build an estimator that provides approximation of fρ close to the optimal for the class Θ. Along with standard penalized least squares estimators we consider a new method of construction of universal estimators. This method is based on a combination of two powerful ideas in building universal estimators. The first one is the use of penalized least squares estimators. This idea works well in the case of general setting with rather abstract methods of approximation. The second one is the idea of thresholding that works very well when we use wavelets expansions as an approximation tool. A new estimator that we call the big jump estimator uses the least squares estimators and chooses a right model by a thresholding criteria instead of the penalization. In this paper we illustrate how ideas and methods of approximation theory can be used in learning theory both in formulating a problem and in solving it.
Bauer F., Pereverzev S., Rosasco L.
Journal of Complexity scimago Q1 wos Q1
2007-02-01 citations by CoLab: 143 Abstract  
In this paper we discuss a relation between Learning Theory and Regularization of linear ill-posed inverse problems. It is well known that Tikhonov regularization can be profitably used in the context of supervised learning, where it usually goes under the name of regularized least-squares algorithm. Moreover, the gradient descent algorithm was studied recently, which is an analog of Landweber regularization scheme. In this paper we show that a notion of regularization defined according to what is usually done for ill-posed inverse problems allows to derive learning algorithms which are consistent and provide a fast convergence rate. It turns out that for priors expressed in term of variable Hilbert scales in reproducing kernel Hilbert spaces our results for Tikhonov regularization match those in Smale and Zhou [Learning theory estimates via integral operators and their approximations, submitted for publication, retrievable at 〈 http://www.tti-c.org/smale.html 〉 , 2005] and improve the results for Landweber iterations obtained in Yao et al. [On early stopping in gradient descent learning, Constructive Approximation (2005), submitted for publication]. The remarkable fact is that our analysis shows that the same properties are shared by a large class of learning algorithms which are essentially all the linear regularization schemes. The concept of operator monotone functions turns out to be an important tool for the analysis.
Liu J., Shi L.
Inverse Problems scimago Q1 wos Q2
2025-03-20 citations by CoLab: 0 Abstract  
Abstract By selecting different filter functions, spectral algorithms can generate various regularization methods to solve statistical inverse problems within the learning-from-samples framework. This paper combines distributed spectral algorithms with Sobolev kernels to tackle the functional linear regression problem. The design and mathematical analysis of the algorithms require only that the functional covariates are observed at discrete sample points. Furthermore, the hypothesis function spaces of the algorithms are the Sobolev spaces generated by the Sobolev kernels, optimizing both approximation capability and flexibility. Through the establishment of regularity conditions for the target function and functional covariate, we derive matching upper and lower bounds for the convergence of the distributed spectral algorithms in the Sobolev norm. This demonstrates that the proposed regularity conditions are reasonable and that the convergence analysis under these conditions is tight, capturing the essential characteristics of functional linear regression. The analytical techniques and estimates developed in this paper also improve existing results in the previous literature.
Li D., Milz J.
2025-01-23 citations by CoLab: 0
Nguyen M., Mücke N.
2024-12-01 citations by CoLab: 1 Abstract  
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel (NTK) regime, trained with gradient descent (GD). For early stopped GD we derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression in reproducing kernel Hilbert spaces. On our way, we precisely keep track of the number of hidden neurons required for generalization and improve over existing results. We further show that the weights during training remain in a vicinity around initialization, the radius being dependent on structural assumptions such as degree of smoothness of the regression function and eigenvalue decay of the integral operator associated to the NTK.
Wang W., Wang Y., Zhang X.
Management Science scimago Q1 wos Q1 Open Access
2024-12-01 citations by CoLab: 2 Abstract  
Nested simulation concerns estimating functionals of a conditional expectation via simulation. In this paper, we propose a new method based on kernel ridge regression to exploit the smoothness of the conditional expectation as a function of the multidimensional conditioning variable. Asymptotic analysis shows that the proposed method can effectively alleviate the curse of dimensionality on the convergence rate as the simulation budget increases, provided that the conditional expectation is sufficiently smooth. The smoothness bridges the gap between the cubic root convergence rate (that is, the optimal rate for the standard nested simulation) and the square root convergence rate (that is, the canonical rate for the standard Monte Carlo simulation). We demonstrate the performance of the proposed method via numerical examples from portfolio risk management and input uncertainty quantification. This paper was accepted by Baris Ata, stochastic models and simulation. Funding: The authors acknowledge financial support from the National Natural Science Foundation of China [Grant NSFC 12101149] and the Hong Kong Research Grants Council [Grants GRF 17201520 and GRF 17206821]. Supplemental Material: The e-companion and data files are available at https://doi.org/10.1287/mnsc.2022.00204 .
Lin S.
2024-11-01 citations by CoLab: 2 Abstract  
This paper focuses on parameter selection issues of kernel ridge regression (KRR). Due to special spectral properties of KRR, we find that delicate subdivision of the parameter interval shrinks the difference between two successive KRR estimates. Based on this observation, we develop an early-stopping type parameter selection strategy for KRR according to the so-called Lepskii-type principle. Theoretical verifications are presented in the framework of learning theory to show that KRR equipped with the proposed parameter selection strategy succeeds in achieving optimal learning rates and adapts to different norms, providing a new record of parameter selection for kernel methods.
Bisiacco M., Pillonetto G.
2024-10-01 citations by CoLab: 0
Rastogi A.
Journal of Complexity scimago Q1 wos Q1
2024-06-01 citations by CoLab: 2 Abstract  
In this paper, we study Tikhonov regularization scheme in Hilbert scales for a nonlinear statistical inverse problem with general noise. The regularizing norm in this scheme is stronger than the norm in the Hilbert space. We focus on developing a theoretical analysis for this scheme based on conditional stability estimates. We utilize the concept of the distance function to establish high probability estimates of the direct and reconstruction errors in the Reproducing Kernel Hilbert space setting. Furthermore, explicit rates of convergence in terms of sample size are established for the oversmoothing case and the regular case over the regularity class defined through an appropriate source condition. Our results improve upon and generalize previous results obtained in related settings.
Shi L., Zhang Z.
Analysis and Applications scimago Q1 wos Q1
2024-04-22 citations by CoLab: 0 Abstract  
Kernel methods are popular in nonlinear and nonparametric regression due to their solid mathematical foundations and optimal statistical properties. However, scalability remains the primary bottleneck in applying kernel methods to large-scale data regression analysis. This paper aims to improve the scalability of kernel methods. We combine Nyström subsampling and the preconditioned conjugate gradient method to solve regularized kernel regression. Our theoretical analysis indicates that achieving optimal convergence rates requires only [Formula: see text] memory and [Formula: see text] time (up to logarithmic factors). Numerical experiments show that our algorithm outperforms existing methods in time efficiency and prediction accuracy on large-scale datasets. Notably, compared to the FALKON algorithm [A. Rudi, L. Carratino and L. Rosasco, Falkon: An optimal large scale kernel method, in Advances in Neural Information Processing Systems (Curran Associates, 2017), pp. 3891–3901], which is known as the optimal large-scale kernel method, our method is more flexible (applicable to non-positive definite kernel functions) and has a lower algorithmic complexity. Additionally, our established theoretical analysis further relaxes the restrictive conditions on hyperparameters previously imposed in convergence analyses.
Lin S., Wang D., Zhou D.
2024-01-25 citations by CoLab: 3
Feng J., Kulick C., Ren Y., Tang S.
Mathematics of Computation scimago Q1 wos Q1
2023-11-15 citations by CoLab: 2 Abstract  
Interacting particle or agent systems that exhibit diverse swarming behaviors are prevalent in science and engineering. Developing effective differential equation models to understand the connection between individual interaction rules and swarming is a fundamental and challenging goal. In this paper, we study the data-driven discovery of a second-order particle swarming model that describes the evolution of N N particles in R d \mathbb {R}^d under radial interactions. We propose a learning approach that models the latent radial interaction function as Gaussian processes, which can simultaneously fulfill two inference goals: one is the nonparametric inference of the interaction function with pointwise uncertainty quantification, and the other is the inference of unknown scalar parameters in the noncollective friction forces of the system. We formulate the learning problem as a statistical inverse learning problem and introduce an operator-theoretic framework that provides a detailed analysis of recoverability conditions, establishing that a coercivity condition is sufficient for recoverability. Given data collected from M M i.i.d trajectories with independent Gaussian observational noise, we provide a finite-sample analysis, showing that our posterior mean estimator converges in a Reproducing Kernel Hilbert Space norm, at an optimal rate in M M equal to the one in the classical 1-dimensional Kernel Ridge regression. As a byproduct, we show we can obtain a parametric learning rate in M M for the posterior marginal variance using L ∞ L^{\infty } norm and that the rate could also involve N N and L L (the number of observation time instances for each trajectory) depending on the condition number of the inverse problem. We provide numerical results on systems exhibiting different swarming behaviors, highlighting the effectiveness of our approach in the scarce, noisy trajectory data regime.
Hucker L., Wahl M.
2023-10-03 citations by CoLab: 0 Abstract  
We analyze the prediction error of principal component regression (PCR) and prove high probability bounds for the corresponding squared risk conditional on the design. Our first main result shows that PCR performs comparably to the oracle method obtained by replacing empirical principal components by their population counterparts, provided that an effective rank condition holds. On the other hand, if the latter condition is violated, then empirical eigenvalues start to have a significant upward bias, resulting in a self-induced regularization of PCR. Our approach relies on the behavior of empirical eigenvalues, empirical eigenvectors and the excess risk of principal component analysis in high-dimensional regimes.
Li Y., Zhang H., Lin Q.
Biometrika scimago Q1 wos Q1
2023-08-07 citations by CoLab: 0 Abstract  
Abstract One of the most interesting problems in the recent renaissance of the studies in kernel regression might be whether kernel interpolation can generalize well, since it may help us understand the ‘benign overfitting phenomenon’ reported in the literature on deep networks. In this paper, under mild conditions, we show that for any ε > 0, the generalization error of kernel interpolation is lower bounded by Ω (n-ε). In other words, the kernel interpolation generalizes poorly for a large class of kernels. As a direct corollary, we can show that overfitted wide neural networks defined on the sphere generalize poorly.
Guo Z., Christmann A., Shi L.
2023-07-26 citations by CoLab: 7 Abstract  
In this paper, we study an online learning algorithm with a robust loss function $$\mathcal {L}_{\sigma }$$ for regression over a reproducing kernel Hilbert space (RKHS). The loss function $$\mathcal {L}_{\sigma }$$ involving a scaling parameter $$\sigma >0$$ can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen $$\sigma $$ and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity-dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.
Bubba T.A., Burger M., Helin T., Ratti L.
Inverse Problems and Imaging scimago Q2 wos Q3
2023-04-12 citations by CoLab: 2 Abstract  
We consider a statistical inverse learning problem, where the task is to estimate a function f based on noisy point evaluations of Af , where A is a linear operator.The function Af is evaluated at i.i.d.random design points un, n = 1, ..., N generated by an unknown general probability distribution.We consider Tikhonov regularization with general convex and p-homogeneous penalty functionals and derive concentration rates of the regularized solution to the ground truth measured in the symmetric Bregman distance induced by the penalty functional.We derive concrete rates for Besov norm penalties and numerically demonstrate the correspondence with the observed rates in the context of X-ray tomography.

Top-30

Journals

1
2
3
4
1
2
3
4

Publishers

1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?