Foundations of Computational Mathematics

Optimality of Robust Online Learning

Publication typeJournal Article
Publication date2023-07-26
scimago Q1
SJR2.546
CiteScore6.9
Impact factor2.5
ISSN16153375, 16153383
Computational Mathematics
Computational Theory and Mathematics
Applied Mathematics
Analysis
Abstract
In this paper, we study an online learning algorithm with a robust loss function $$\mathcal {L}_{\sigma }$$ for regression over a reproducing kernel Hilbert space (RKHS). The loss function $$\mathcal {L}_{\sigma }$$ involving a scaling parameter $$\sigma >0$$ can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen $$\sigma $$ and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity-dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.
Guo X., Guo Z., Shi L.
2023-11-01 citations by CoLab: 10 Abstract  
This article provides convergence analysis of online stochastic gradient descent algorithms for functional linear models. Adopting the characterizations of the slope function regularity, the kernel space capacity, and the capacity of the sampling process covariance operator, significant improvement on the convergence rates is achieved. Both prediction problems and estimation problems are studied, where we show that capacity assumption can alleviate the saturation of the convergence rate as the regularity of the target function increases. We show that with properly selected kernel, capacity assumptions can fully compensate for the regularity assumptions for prediction problems (but not for estimation problems). This demonstrates the significant difference between the prediction problems and the estimation problems in functional data analysis.
Zhu X., Li Z., Sun J.
2023-01-01 citations by CoLab: 6 Abstract  
Expression recognition has been an important research direction in the field of psychology, which can be used in traffic, medical, security, and criminal investigation by expressing human feelings through the muscles in the corners of the mouth, eyes, and face. Most of the existing research work uses convolutional neural networks (CNN) to recognize face images and thus classify expressions, which does achieve good results, but CNN do not have enough ability to extract global features. The Transformer has advantages for global feature extraction, but the Transformer is more computationally intensive and requires a large amount of training data. So, in this paper, we use the hierarchical Transformer, namely Swin Transformer, for the expression recognition task, and its computational power will be greatly reduced. At the same time, it is fused with a CNN model to propose a network architecture that combines the Transformer and CNN, and to the best of our knowledge, we are the first to combine the Swin Transformer with CNN and use it in an expression recognition task. We then evaluate the proposed method on some publicly available expression datasets and can obtain competitive results.
Feng H., Hou S., Wei L., Zhou D.
2022-07-07 citations by CoLab: 20 Abstract  
<p style='text-indent:20px;'>Readability of Chinese texts considered in this paper is a multi-class classification problem with <inline-formula><tex-math id="M1">\begin{document}$ 12 $\end{document}</tex-math></inline-formula> grade classes corresponding to <inline-formula><tex-math id="M2">\begin{document}$ 6 $\end{document}</tex-math></inline-formula> grades in primary schools, <inline-formula><tex-math id="M3">\begin{document}$ 3 $\end{document}</tex-math></inline-formula> grades in middle schools, and <inline-formula><tex-math id="M4">\begin{document}$ 3 $\end{document}</tex-math></inline-formula> grades in high schools. A special property of this problem is the strong ambiguity in determining the grades. To overcome the difficulty, a measurement of readability assessment methods used empirically in practice is adjacent accuracy in addition to exact accuracy. In this paper we give mathematical definitions of these concepts in a learning theory framework and compare these two quantities in terms of the ambiguity level of texts. A deep learning algorithm is proposed for readability of Chinese texts, based on convolutional neural networks and a pre-trained BERT model for vector representations of Chinese characters. The proposed CNN model can extract sentence and text features by convolutions of sentence representations with filters and is efficient for readability assessment, which is demonstrated with some numerical experiments.</p>
Chen X., Tang B., Fan J., Guo X.
Journal of Complexity scimago Q1 wos Q1
2022-06-01 citations by CoLab: 16 Abstract  
Functional linear model is a fruitfully applied general framework for regression problems, including those with intrinsically infinite-dimensional data. Online gradient descent methods, despite their evidenced power of processing online or large-sized data, are not well studied for learning with functional data. In this paper, we study reproducing kernel-based online learning algorithms for functional data, and derive convergence rates for the expected excess prediction risk under both online and finite-horizon settings of step-sizes respectively. It is well understood that nontrivial uniform convergence rates for the estimation task depend on the regularity of the slope function. Surprisingly, the convergence rates we derive for the prediction task can assume no regularity from slope. Our analysis reveals the intrinsic difference between the estimation task and the prediction task in functional data learning.
Feng Y., Wu Q.
Neural Computation scimago Q1 wos Q3
2021-04-15 citations by CoLab: 5 Abstract  
We develop in this letter a framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may be present in the response variable. The idea of EGM is to approximate the density function of the noise distribution instead of approximating the truth function directly as usual. Unlike the classical maximum likelihood estimation that encourages equal importance of all observations and could be problematic in the presence of abnormal observations, EGM schemes can be interpreted from a minimum distance estimation viewpoint and allow the ignorance of those observations. Furthermore, we show that several well-known robust nonconvex regression paradigms, such as Tukey regression and truncated least square regression, can be reformulated into this new framework. We then develop a learning theory for EGM by means of which a unified analysis can be conducted for these well-established but not fully understood regression approaches. This new framework leads to a novel interpretation of existing bounded nonconvex loss functions. Within this new framework, the two seemingly irrelevant terminologies, the well-known Tukey's biweight loss for robust regression and the triweight kernel for nonparametric smoothing, are closely related. More precisely, we show that Tukey's biweight loss can be derived from the triweight kernel. Other frequently employed bounded nonconvex loss functions in machine learning, such as the truncated square loss, the Geman-McClure loss, and the exponential squared loss, can also be reformulated from certain smoothing kernels in statistics. In addition, the new framework enables us to devise new bounded nonconvex loss functions for robust learning.
Huang S., Feng Y., Wu Q.
Analysis and Applications scimago Q1 wos Q1
2021-02-10 citations by CoLab: 26 Abstract  
Minimum error entropy (MEE) is an information theoretic learning approach that minimizes the information contained in the prediction error, which is measured by entropy. It has been successfully used in various machine learning tasks for its robustness to heavy-tailed distributions and outliers. In this paper, we consider its use in nonparametric regression and analyze its generalization performance from a learning theory perspective by imposing a [Formula: see text]th order moment condition on the noise variable. To this end, we establish a comparison theorem to characterize the relation between the excess generalization error and the prediction error. A relaxed Bernstein condition and concentration inequalities are used to derive error bounds and learning rates. Note that the [Formula: see text]th moment condition is rather weak particularly when [Formula: see text] because the noise variable does not even admit a finite variance in this case. Therefore, our analysis explains the robustness of MEE in the presence of heavy-tailed distributions.
Lu S., Mathé P., Pereverzev S.V.
2020-01-01 citations by CoLab: 20 Abstract  
We discuss the problem of parameter choice in learning algorithms generated by a general regularization scheme. Such a scheme covers well-known algorithms as regularized least squares and gradient descent learning. It is known that in contrast to classical deterministic regularization methods, the performance of regularized learning algorithms is influenced not only by the smoothness of a target function, but also by the capacity of a space, where regularization is performed. In the infinite dimensional case the latter one is usually measured in terms of the effective dimension. In the context of supervised learning both the smoothness and effective dimension are intrinsically unknown a priori . Therefore we are interested in a posteriori regularization parameter choice, and we propose a new form of the balancing principle. An advantage of this strategy over the known rules such as cross-validation based adaptation is that it does not require any data splitting and allows the use of all available labeled data in the construction of regularized approximants. We provide the analysis of the proposed rule and demonstrate its advantage in simulations. • For smooth kernels, a power decay of effective dimensions may not be appropriate. • Learning rates under general smoothness assumptions on target functions. • Optimal learning rates in all involved norms based on a novel bound. • An adaptive parameter choice which uses the whole dataset without data splitting.
Lv F., Fan J.
Analysis and Applications scimago Q1 wos Q1
2019-11-11 citations by CoLab: 23 Abstract  
Correntropy-based learning has achieved great success in practice during the last decades. It is originated from information-theoretic learning and provides an alternative to classical least squares method in the presence of non-Gaussian noise. In this paper, we investigate the theoretical properties of learning algorithms generated by Tikhonov regularization schemes associated with Gaussian kernels and correntropy loss. By choosing an appropriate scale parameter of Gaussian kernel, we show the polynomial decay of approximation error under a Sobolev smoothness condition. In addition, we employ a tight upper bound for the uniform covering number of Gaussian RKHS in order to improve the estimate of sample error. Based on these two results, we show that the proposed algorithm using varying Gaussian kernel achieves the minimax rate of convergence (up to a logarithmic factor) without knowing the smoothness level of the regression function.
Schölkopf B., Smola A.J.
2018-12-14 citations by CoLab: 507 Abstract  
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
Guo Z., Hu T., Shi L.
Inverse Problems scimago Q1 wos Q2
2018-05-08 citations by CoLab: 12 Abstract  
In this paper, we study the gradient descent algorithm generated by a robust loss function over a reproducing kernel Hilbert space (RKHS). The loss function is defined by a windowing function G and a scale parameter σ, which can include a wide range of commonly used robust losses for regression. There is still a gap between theoretical analysis and optimization process of empirical risk minimization based on loss: the estimator needs to be global optimal in the theoretical analysis while the optimization method can not ensure the global optimality of its solutions. In this paper, we aim to fill this gap by developing a novel theoretical analysis on the performance of estimators generated by the gradient descent algorithm. We demonstrate that with an appropriately chosen scale parameter σ, the gradient update with early stopping rules can approximate the regression function. Our elegant error analysis can lead to convergence in the standard L 2 norm and the strong RKHS norm, both of which are optimal in the mini-max sense. We show that the scale parameter σ plays an important role in providing robustness as well as fast convergence. The numerical experiments implemented on synthetic examples and real data set also support our theoretical results.
Bottou L., Curtis F.E., Nocedal J.
SIAM Review scimago Q1 wos Q1
2018-05-03 citations by CoLab: 1572 Abstract  
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques th...
Guo Z., Lin S., Zhou D.
Inverse Problems scimago Q1 wos Q2
2017-06-21 citations by CoLab: 69 Abstract  
Spectral algorithms have been widely used and studied in learning theory and inverse problems. This paper is concerned with distributed spectral algorithms, for handling big data, based on a divide-and-conquer approach. We present a learning theory for these distributed kernel-based learning algorithms in a regression framework including nice error bounds and optimal minimax learning rates achieved by means of a novel integral operator approach and a second order decomposition of inverse operators. Our quantitative estimates are given in terms of regularity of the regression function, effective dimension of the reproducing kernel Hilbert space, and qualification of the filter function of the spectral algorithm. They do not need any eigenfunction or noise conditions and are better than the existing results even for the classical family of spectral algorithms.
Blanchard G., Mücke N.
2017-06-20 citations by CoLab: 41 Abstract  
We consider a statistical inverse learning (also called inverse regression) problem, where we observe the image of a function f through a linear operator A at i.i.d. random design points $$X_i$$ , superposed with an additive noise. The distribution of the design points is unknown and can be very general. We analyze simultaneously the direct (estimation of Af) and the inverse (estimation of f) learning problems. In this general framework, we obtain strong and weak minimax optimal rates of convergence (as the number of observations n grows large) for a large class of spectral regularization methods over regularity classes defined through appropriate source conditions. This improves on or completes previous results obtained in related settings. The optimality of the obtained rates is shown not only in the exponent in n but also in the explicit dependency of the constant factor in the variance of the noise and the radius of the source condition set.
Ying Y., Zhou D.
2017-03-01 citations by CoLab: 33 Abstract  
In this paper, we consider unregularized online learning algorithms in a Reproducing Kernel Hilbert Space (RKHS). Firstly, we derive explicit convergence rates of the unregularized online learning algorithms for classification associated with a general α -activating loss (see Definition 1 below). Our results extend and refine the results in [30] for the least square loss and the recent result [3] for the loss function with a Lipschitz-continuous gradient. Moreover, we establish a very general condition on the step sizes which guarantees the convergence of the last iterate of such algorithms. Secondly, we establish, for the first time, the convergence of the unregularized pairwise learning algorithm with a general loss function and derive explicit rates under the assumption of polynomially decaying step sizes. Concrete examples are used to illustrate our main results. The main techniques are tools from convex analysis, refined inequalities of Gaussian averages [5] , and an induction approach.
Dieuleveut A., Bach F.
Annals of Statistics scimago Q1 wos Q1
2016-07-07 citations by CoLab: 39 Abstract  
We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS $\mathcal{H}$, even if the optimal predictor (i.e., the conditional expectation) is not in $\mathcal{H}$. In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.
Zhang Y., Fang Z., Fan J.
Neural Networks scimago Q1 wos Q1
2024-06-01 citations by CoLab: 2 Abstract  
Convolutional neural networks (CNNs) have gained immense popularity in recent years, finding their utility in diverse fields such as image recognition, natural language processing, and bio-informatics. Despite the remarkable progress made in deep learning theory, most studies on CNNs, especially in regression tasks, tend to heavily rely on the least squares loss function. However, there are situations where such learning algorithms may not suffice, particularly in the presence of heavy-tailed noises or outliers. This predicament emphasizes the necessity of exploring alternative loss functions that can handle such scenarios more effectively, thereby unleashing the true potential of CNNs. In this paper, we investigate the generalization error of deep CNNs with the rectified linear unit (ReLU) activation function for robust regression problems within an information-theoretic learning framework. Our study demonstrates that when the regression function exhibits an additive ridge structure and the noise possesses a finite pth moment, the empirical risk minimization scheme, generated by the maximum correntropy criterion and deep CNNs, achieves fast convergence rates. Notably, these rates align with the mini-max optimal convergence rates attained by fully connected neural network model with the Huber loss function up to a logarithmic factor. Additionally, we further establish the convergence rates of deep CNNs under the maximum correntropy criterion when the regression function resides in a Sobolev space on the sphere.
Mao Y., Guo Z.
Journal of Complexity scimago Q1 wos Q1
2024-06-01 citations by CoLab: 2 Abstract  
In recent years, functional linear models have attracted growing attention in statistics and machine learning, with the aim of recovering the slope function or its functional predictor. This paper considers online regularized learning algorithm for functional linear models in reproducing kernel Hilbert spaces. Convergence analysis of excess prediction error and estimation error are provided with polynomially decaying step-size and constant step-size, respectively. Fast convergence rates can be derived via a capacity dependent analysis. By introducing an explicit regularization term, we uplift the saturation boundary of unregularized online learning algorithms when the step-size decays polynomially, and establish fast convergence rates of estimation error without capacity assumption. However, it remains an open problem to obtain capacity independent convergence rates for the estimation error of the unregularized online learning algorithm with decaying step-size. It also shows that convergence rates of both prediction error and estimation error with constant step-size are competitive with those in the literature.
Fang Z., Mao T., Fan J.
Neural Computation scimago Q1 wos Q3
2024-03-21 citations by CoLab: 0 Abstract  
Abstract Combining information-theoretic learning with deep learning has gained significant attention in recent years, as it offers a promising approach to tackle the challenges posed by big data. However, the theoretical understanding of convolutional structures, which are vital to many structured deep learning models, remains incomplete. To partially bridge this gap, this letter aims to develop generalization analysis for deep convolutional neural network (CNN) algorithms using learning theory. Specifically, we focus on investigating robust regression using correntropy-induced loss functions derived from information-theoretic learning. Our analysis demonstrates an explicit convergence rate for deep CNN-based robust regression algorithms when the target function resides in the Korobov space. This study sheds light on the theoretical underpinnings of CNNs and provides a framework for understanding their performance and limitations.
Wang Y., Guo Z.
Applied Mathematics scimago Q4 wos Q2
2024-03-09 citations by CoLab: 0 Abstract  
In the realm of large-scale machine learning, it is crucial to explore methods for reducing computational complexity and memory demands while maintaining generalization performance. Additionally, since the collected data may contain some sensitive information, it is also of great significance to study privacy-preserving machine learning algorithms. This paper focuses on the performance of the differentially private stochastic gradient descent (SGD) algorithm based on random features. To begin, the algorithm maps the original data into a low-dimensional space, thereby avoiding the traditional kernel method for large-scale data storage requirement. Subsequently, the algorithm iteratively optimizes parameters using the stochastic gradient descent approach. Lastly, the output perturbation mechanism is employed to introduce random noise, ensuring algorithmic privacy. We prove that the proposed algorithm satisfies the differential privacy while achieving fast convergence rates under some mild conditions.
Mao Y., Shi L., Guo Z.
Journal of Approximation Theory scimago Q2 wos Q2
2024-01-01 citations by CoLab: 0 Abstract  
In this paper, we consider the coefficient-based regularized distribution regression which aims to regress from probability measures to real-valued responses over a reproducing kernel Hilbert space (RKHS), where the regularization is put on the coefficients and kernels are assumed to be indefinite. The algorithm involves two stages of sampling, the first stage sample consists of distributions and the second stage sample is obtained from these distributions. The asymptotic behavior of the algorithm is comprehensively studied across different regularity ranges of the regression function. Explicit learning rates are derived by using kernel mean embedding and integral operator techniques. We obtain the optimal rates under some mild conditions, which match the one-stage sampled minimax optimal rate. Compared with the kernel methods for distribution regression in existing literature, the algorithm under consideration does not require the kernel to be symmetric or positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods, which enriches the theme of the distribution regression. To the best of our knowledge, this is the first result for distribution regression with indefinite kernels, and our algorithm can improve the learning performance against saturation effect.

Top-30

Journals

1
2
3
1
2
3

Publishers

1
2
3
4
5
1
2
3
4
5
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex
Found error?