Applied and Computational Harmonic Analysis, volume 70, pages 101632

High-probability generalization bounds for pointwise uniformly stable algorithms

Publication typeJournal Article
Publication date2024-05-01
scimago Q1
SJR2.231
CiteScore5.4
Impact factor2.6
ISSN10635203, 1096603X
Applied Mathematics
Abstract
Algorithmic stability is a fundamental concept in statistical learning theory to understand the generalization behavior of optimization algorithms. Existing high-probability bounds are developed for the generalization gap as measured by function values and require the algorithm to be uniformly stable. In this paper, we introduce a novel stability measure called pointwise uniform stability by considering the sensitivity of the algorithm with respect to the perturbation of each training example. We show this weaker pointwise uniform stability guarantees almost optimal bounds, and gives the first high-probability bound for the generalization gap as measured by gradients. Sharper bounds are given for strongly convex and smooth problems. We further apply our general result to derive improved generalization bounds for stochastic gradient descent. As a byproduct, we develop concentration inequalities for a summation of weakly-dependent vector-valued random variables.
Guo X., Guo Z., Shi L.
2023-11-01 citations by CoLab: 10 Abstract  
This article provides convergence analysis of online stochastic gradient descent algorithms for functional linear models. Adopting the characterizations of the slope function regularity, the kernel space capacity, and the capacity of the sampling process covariance operator, significant improvement on the convergence rates is achieved. Both prediction problems and estimation problems are studied, where we show that capacity assumption can alleviate the saturation of the convergence rate as the regularity of the target function increases. We show that with properly selected kernel, capacity assumptions can fully compensate for the regularity assumptions for prediction problems (but not for estimation problems). This demonstrates the significant difference between the prediction problems and the estimation problems in functional data analysis.
Villa S., Matet S., Vũ B.C., Rosasco L.
Analysis and Applications scimago Q1 wos Q1
2022-11-18 citations by CoLab: 4 Abstract  
Implicit regularization refers to the property of optimization algorithms to be biased towards a certain class of solutions. This property is relevant to understand the behavior of modern machine learning algorithms as well as to design efficient computational methods. While the case where the bias is given by a Euclidean norm is well understood, implicit regularization schemes for more general classes of biases are much less studied. In this work, we consider the case where the bias is given by a strongly convex functional, in the context of linear models, and data possibly corrupted by noise. In particular, we propose and analyze accelerated optimization methods and highlight a trade-off between convergence speed and stability. Theoretical findings are complemented by an empirical analysis on high-dimensional inverse problems in machine learning and signal processing, showing excellent results compared to the state of the art.
Chen X., Tang B., Fan J., Guo X.
Journal of Complexity scimago Q1 wos Q1
2022-06-01 citations by CoLab: 16 Abstract  
Functional linear model is a fruitfully applied general framework for regression problems, including those with intrinsically infinite-dimensional data. Online gradient descent methods, despite their evidenced power of processing online or large-sized data, are not well studied for learning with functional data. In this paper, we study reproducing kernel-based online learning algorithms for functional data, and derive convergence rates for the expected excess prediction risk under both online and finite-horizon settings of step-sizes respectively. It is well understood that nontrivial uniform convergence rates for the estimation task depend on the regularity of the slope function. Surprisingly, the convergence rates we derive for the prediction task can assume no regularity from slope. Our analysis reveals the intrinsic difference between the estimation task and the prediction task in functional data learning.
Wang P., Lei Y., Ying Y., Zhang H.
2022-01-01 citations by CoLab: 9 Abstract  
In this paper, we are concerned with differentially private stochastic gradient descent (SGD) algorithms in the setting of stochastic convex optimization (SCO). Most of the existing work requires the loss to be Lipschitz continuous and strongly smooth, and the model parameter to be uniformly bounded. However, these assumptions are restrictive as many popular losses violate these conditions including the hinge loss for SVM, the absolute loss in robust regression, and even the least square loss in an unbounded domain. We significantly relax these restrictive assumptions and establish privacy and generalization (utility) guarantees for private SGD algorithms using output and gradient perturbations associated with non-smooth convex losses. Specifically, the loss function is relaxed to have an α -Hölder continuous gradient (referred to as α-Hölder smoothness ) which instantiates the Lipschitz continuity ( α = 0 ) and the strong smoothness ( α = 1 ). We prove that noisy SGD with α -Hölder smooth losses using gradient perturbation can guarantee ( ϵ , δ ) -differential privacy (DP) and attain optimal excess population risk O ( d log ⁡ ( 1 / δ ) n ϵ + 1 n ) , up to logarithmic terms, with the gradient complexity O ( n 2 − α 1 + α + n ) . This shows an important trade-off between α -Hölder smoothness of the loss and the computational complexity for private SGD with statistically optimal performance. In particular, our results indicate that α -Hölder smoothness with α ≥ 1 / 2 is sufficient to guarantee ( ϵ , δ ) -DP of noisy SGD algorithms while achieving optimal excess risk with a linear gradient complexity O ( n ) .
Vershynin R.
2018-09-27 citations by CoLab: 749
Ying Y., Zhou D.
2017-03-01 citations by CoLab: 33 Abstract  
In this paper, we consider unregularized online learning algorithms in a Reproducing Kernel Hilbert Space (RKHS). Firstly, we derive explicit convergence rates of the unregularized online learning algorithms for classification associated with a general α -activating loss (see Definition 1 below). Our results extend and refine the results in [30] for the least square loss and the recent result [3] for the loss function with a Lipschitz-continuous gradient. Moreover, we establish a very general condition on the step sizes which guarantees the convergence of the last iterate of such algorithms. Secondly, we establish, for the first time, the convergence of the unregularized pairwise learning algorithm with a general loss function and derive explicit rates under the assumption of polynomially decaying step sizes. Concrete examples are used to illustrate our main results. The main techniques are tools from convex analysis, refined inequalities of Gaussian averages [5] , and an induction approach.
Ghadimi S., Lan G.
SIAM Journal on Optimization scimago Q1 wos Q1
2013-12-03 citations by CoLab: 497 Abstract  
In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a post-optimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and show that such modification allows to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available.
Boucheron S., Lugosi G., Massart P.
2013-02-07 citations by CoLab: 1226
Cucker F., Zhou D.X.
2007-03-29 citations by CoLab: 361 Abstract  
The goal of learning theory is to approximate a function from sample values. To attain this goal learning theory draws on a variety of diverse subjects, specifically statistics, approximation theory, and algorithmics. Ideas from all these areas blended to form a subject whose many successful applications have triggered a rapid growth during the last two decades. This is the first book to give a general overview of the theoretical foundations of the subject emphasizing the approximation theory, while still giving a balanced overview. It is based on courses taught by the authors, and is reasonably self-contained so will appeal to a broad spectrum of researchers in learning theory and adjacent fields. It will also serve as an introduction for graduate students and others entering the field, who wish to see how the problems raised in learning theory relate to other disciplines.
Evgeniou T., Pontil M., Elisseeff A.
Machine Learning scimago Q1 wos Q2
2004-04-01 citations by CoLab: 85 Abstract  
We study the leave-one-out and generalization errors of voting combinations of learning machines. A special case considered is a variant of bagging. We analyze in detail combinations of kernel machines, such as support vector machines, and present theoretical estimates of their leave-one-out error. We also derive novel bounds on the stability of combinations of any classifiers. These bounds can be used to formally show that, for example, bagging increases the stability of unstable learning machines. We report experiments supporting the theoretical findings.
Zhou D.
Journal of Complexity scimago Q1 wos Q1
2002-09-18 citations by CoLab: 199 Abstract  
The covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space plays an important role in Learning Theory. We give estimates for this covering number by means of the regularity of the Mercer kernel K . For convolution type kernels K ( x , t )= k ( x − t ) on [0,1] n , we provide estimates depending on the decay of k̂ , the Fourier transform of k . In particular, when k̂ decays exponentially, our estimate for this covering number is better than all the previous results and covers many important Mercer kernels. A counter example is presented to show that the eigenfunctions of the Hilbert–Schmidt operator Lm K associated with a Mercer kernel K may not be uniformly bounded. Hence some previous methods used for estimating the covering number in Learning Theory are not valid. We also provide an example of a Mercer kernel to show that L K 1/2 may not be generated by a Mercer kernel.
Bartlett P.L., Mendelson S.
2001-01-01 citations by CoLab: 167 Abstract  
We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and gaussian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes.We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines.
Raab M., Steger A.
1998-01-01 citations by CoLab: 225 Abstract  
Suppose we sequentially throw m balls into n bins. It is a natural question to ask for the maximum number of balls in any bin. In this paper we shall derive sharp upper and lower bounds which are reached with high probability. We prove bounds for all values of m(n) ≧ n/polylog(n) by using the simple and well-known method of the first and second moment.
Vapnik V.N.
1995-01-01 citations by CoLab: 16888
Devroye L., Wagner T.
1979-03-01 citations by CoLab: 62 Abstract  
In the discrimination problem the random variable \theta , known to take values in {1 ,\ldots ,M} , is estimated from the random vector X taking values in {\bfR}^{d} . Ali that is known about the joint distribution of (X,O) is that which can be inferred from a sample (X_{1} , \theta_{1}, \ldots , (X_{n}, \theta_{n}) of size n drawn from that distribution. A discrimination rule is any procedure which determines a decision \hat{\theta} for \theta from X and (X_{1},\theta_{1}) , \ldots , (X_{n}, \theta_{n}) . The rule is called k -local if the decision \hat{\theta} depends only on X and the pairs (X_{i}, \theta_{i}) ,for which X_{i} is one of the k closest to X from X_{1} , \ldots ,X_{n} . If L_{n} denotes the probability of error for a k -local rule given the sample, then estimates \hat{L}_{n} of L_{n} , are determined for which P {| \hat{L}_{n} - L_{n} \geq \epsilon} \exp (- Bn) , where A and B are positive constants depending only on d , M , and \epsilon .
Yang A., Fan J., Xiang D.
Journal of Complexity scimago Q1 wos Q1
2025-06-01 citations by CoLab: 0
Wang C., Fan J.
Journal of Complexity scimago Q1 wos Q1
2024-10-01 citations by CoLab: 1

Top-30

Journals

1
2
1
2

Publishers

1
2
1
2
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex
Found error?