Journal of Complexity, volume 88, pages 101931

Online outcome weighted learning with general loss functions

Aoli Yang
Jun Fan
Dao-Hong Xiang
Publication typeJournal Article
Publication date2025-06-01
scimago Q1
wos Q1
SJR1.115
CiteScore3.1
Impact factor1.8
ISSN0885064X, 10902708
Fan J., Lei Y.
2024-05-01 citations by CoLab: 2 Abstract  
Algorithmic stability is a fundamental concept in statistical learning theory to understand the generalization behavior of optimization algorithms. Existing high-probability bounds are developed for the generalization gap as measured by function values and require the algorithm to be uniformly stable. In this paper, we introduce a novel stability measure called pointwise uniform stability by considering the sensitivity of the algorithm with respect to the perturbation of each training example. We show this weaker pointwise uniform stability guarantees almost optimal bounds, and gives the first high-probability bound for the generalization gap as measured by gradients. Sharper bounds are given for strongly convex and smooth problems. We further apply our general result to derive improved generalization bounds for stochastic gradient descent. As a byproduct, we develop concentration inequalities for a summation of weakly-dependent vector-valued random variables.
Guo X., Guo Z., Shi L.
2023-11-01 citations by CoLab: 10 Abstract  
This article provides convergence analysis of online stochastic gradient descent algorithms for functional linear models. Adopting the characterizations of the slope function regularity, the kernel space capacity, and the capacity of the sampling process covariance operator, significant improvement on the convergence rates is achieved. Both prediction problems and estimation problems are studied, where we show that capacity assumption can alleviate the saturation of the convergence rate as the regularity of the target function increases. We show that with properly selected kernel, capacity assumptions can fully compensate for the regularity assumptions for prediction problems (but not for estimation problems). This demonstrates the significant difference between the prediction problems and the estimation problems in functional data analysis.
Guo Z., Christmann A., Shi L.
2023-07-26 citations by CoLab: 7 Abstract  
In this paper, we study an online learning algorithm with a robust loss function $$\mathcal {L}_{\sigma }$$ for regression over a reproducing kernel Hilbert space (RKHS). The loss function $$\mathcal {L}_{\sigma }$$ involving a scaling parameter $$\sigma >0$$ can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen $$\sigma $$ and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity-dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.
Shah K.S., Fu H., Kosorok M.R.
Biometrics scimago Q1 wos Q2
2023-01-23 citations by CoLab: 3 Abstract  
In recent years, the field of precision medicine has seen many advancements. Significant focus has been placed on creating algorithms to estimate individualized treatment rules (ITRs), which map from patient covariates to the space of available treatments with the goal of maximizing patient outcome. Direct Learning (D-Learning) is a recent one-step method which estimates the ITR by directly modeling the treatment-covariate interaction. However, when the variance of the outcome is heterogeneous with respect to treatment and covariates, D-Learning does not leverage this structure. Stabilized Direct Learning (SD-Learning), proposed in this paper, utilizes potential heteroscedasticity in the error term through a residual reweighting which models the residual variance via flexible machine learning algorithms such as XGBoost and random forests. We also develop an internal cross-validation scheme which determines the best residual model amongst competing models. SD-Learning improves the efficiency of D-Learning estimates in binary and multi-arm treatment scenarios. The method is simple to implement and an easy way to improve existing algorithms within the D-Learning family, including original D-Learning, Angle-based D-Learning (AD-Learning), and Robust D-Learning (RD-Learning). We provide theoretical properties and justification of the optimality of SD-Learning. Head-to-head performance comparisons with D-Learning methods are provided through simulations, which demonstrate improvement in terms of average prediction error (APE), misclassification rate, and empirical value, along with a data analysis of an AIDS randomized clinical trial. This article is protected by copyright. All rights reserved.
Guo Z., Hu T., Shi L.
Inverse Problems scimago Q1 wos Q2
2022-12-30 citations by CoLab: 2 Abstract  
Abstract This paper considers spectral pairwise ranking algorithms in a reproducing kernel Hilbert space. The concerned algorithms include a large family of regularized pairwise learning algorithms. Motivated by regularization methods, spectral algorithms are proposed to solve ill-posed linear inverse problems, then developed in learning theory and inverse problems. Recently, pairwise learning tasks such as bipartite ranking, similarity metric learning, Minimum Error Entropy principle, and AUC maximization have received increasing attention due to their wide applications. However, the spectral algorithm acts on the spectrum of the empirical integral operator or kernel matrix, involving the singular value decomposition or the inverse of the matrix, which is time-consuming when the sample size is immense. Our contribution is twofold. First, under some general source conditions and capacity assumptions, we establish the first-ever mini-max optimal convergence rates for spectral pairwise ranking algorithms. Second, we consider the distributed version of the algorithms based on a divide-and-conquer approach and show that, as long as the partition of the data set is not too large, the distributed learning algorithm enjoys both computational efficiency and statistical optimality.
Chen X., Tang B., Fan J., Guo X.
Journal of Complexity scimago Q1 wos Q1
2022-06-01 citations by CoLab: 16 Abstract  
Functional linear model is a fruitfully applied general framework for regression problems, including those with intrinsically infinite-dimensional data. Online gradient descent methods, despite their evidenced power of processing online or large-sized data, are not well studied for learning with functional data. In this paper, we study reproducing kernel-based online learning algorithms for functional data, and derive convergence rates for the expected excess prediction risk under both online and finite-horizon settings of step-sizes respectively. It is well understood that nontrivial uniform convergence rates for the estimation task depend on the regularity of the slope function. Surprisingly, the convergence rates we derive for the prediction task can assume no regularity from slope. Our analysis reveals the intrinsic difference between the estimation task and the prediction task in functional data learning.
Lu S., Mathé P.
Mathematics of Computation scimago Q1 wos Q1
2021-11-10 citations by CoLab: 9 Abstract  
We investigate stochastic gradient decent (SGD) for solving full infinite dimensional ill-posed problems in Hilbert spaces. We allow for batch-size versions of SGD where the randomly chosen batches incur noise fluctuations. Based on the corresponding bias-variance decomposition we provide bounds for the root mean squared error. These bounds take into account the discretization levels, the decay of the step-size, which is more flexible than in existing results, and the underlying smoothness in terms of general source conditions. This allows to apply SGD to severely ill-posed problems. The obtained error bounds exhibit three stages of the performance of SGD. In particular, the pre-asymptotic behavior can be well seen. Some numerical studies verify the theoretical predictions.
Hoi S.C., Sahoo D., Lu J., Zhao P.
Neurocomputing scimago Q1 wos Q1
2021-10-01 citations by CoLab: 316 Abstract  
Online learning represents a family of machine learning methods, where a learner attempts to tackle some predictive (or any type of decision-making) task by learning from a sequence of data instances one by one at each time. The goal of online learning is to maximize the accuracy/correctness for the sequence of predictions/decisions made by the online learner given the knowledge of correct answers to previous prediction/learning tasks and possibly additional information. This is in contrast to traditional batch or offline machine learning methods that are often designed to learn a model from the entire training data set at once. Online learning has become a promising technique for learning from continuous streams of data in many real-world applications. This survey aims to provide a comprehensive survey of the online machine learning literature through a systematic review of basic ideas and key principles and a proper categorization of different algorithms and techniques. Generally speaking, according to the types of learning tasks and the forms of feedback information, the existing online learning works can be classified into three major categories: (i) online supervised learning where full feedback information is always available, (ii) online learning with limited feedback, and (iii) online unsupervised learning where no feedback is available. Due to space limitation, the survey will be mainly focused on the first category, but also briefly cover some basics of the other two categories. Finally, we also discuss some open issues and attempt to shed light on potential future research directions in this field.
Fan J., Xiang D.
2020-06-01 citations by CoLab: 4 Abstract  
High-dimensional binary classification has been intensively studied in the community of machine learning in the last few decades. Support vector machine (SVM), one of the most popular classifier, depends on only a portion of training samples called support vectors which leads to suboptimal performance in the setting of high dimension and low sample size (HDLSS). Large-margin unified machines (LUMs) are a family of margin-based classifiers proposed to solve the so-called 'data piling' problem which is inherent in SVM under HDLSS settings. In this paper we study the binary classification algorithms associated with LUM loss functions in the framework of reproducing kernel Hilbert spaces. Quantitative convergence analysis has been carried out for these algorithms by means of a novel application of projection operators to overcome the technical difficulty. The rates are explicitly derived under priori conditions on approximation and capacity of the reproducing kernel Hilbert space.
Lu S., Mathé P., Pereverzev S.V.
2020-01-01 citations by CoLab: 20 Abstract  
We discuss the problem of parameter choice in learning algorithms generated by a general regularization scheme. Such a scheme covers well-known algorithms as regularized least squares and gradient descent learning. It is known that in contrast to classical deterministic regularization methods, the performance of regularized learning algorithms is influenced not only by the smoothness of a target function, but also by the capacity of a space, where regularization is performed. In the infinite dimensional case the latter one is usually measured in terms of the effective dimension. In the context of supervised learning both the smoothness and effective dimension are intrinsically unknown a priori . Therefore we are interested in a posteriori regularization parameter choice, and we propose a new form of the balancing principle. An advantage of this strategy over the known rules such as cross-validation based adaptation is that it does not require any data splitting and allows the use of all available labeled data in the construction of regularized approximants. We provide the analysis of the proposed rule and demonstrate its advantage in simulations. • For smooth kernels, a power decay of effective dimensions may not be appropriate. • Learning rates under general smoothness assumptions on target functions. • Optimal learning rates in all involved norms based on a novel bound. • An adaptive parameter choice which uses the whole dataset without data splitting.
Fan J., Lv F., Shi L.
2019-07-08 citations by CoLab: 9 Abstract  
In recent years there has been massive interest in precision medicine, which aims to tailor treatment plans to the individual characteristics of each patient. This paper studies the estimation of individualized treatment rules (ITR) based on functional predictors such as images or spectra. We consider a reproducing kernel Hilbert space (RKHS) approach to learn the optimal ITR which maximizes the expected clinical outcome. The algorithm can be conveniently implemented although it involves infinite-dimensional functional data. We provide convergence rate for prediction under mild conditions, which is jointly determined by both the covariance kernel and the reproducing kernel.
Guo Z., Shi L.
2019-06-06 citations by CoLab: 10 Abstract  
In this paper, we study the online learning algorithm without explicit regularization terms. This algorithm is essentially a stochastic gradient descent scheme in a reproducing kernel Hilbert space (RKHS). The polynomially decaying step size in each iteration can play a role of regularization to ensure the generalization ability of online learning algorithm. We develop a novel capacity dependent analysis on the performance of the last iterate of online learning algorithm. This answers an open problem in learning theory. The contribution of this paper is twofold. First, our novel capacity dependent analysis can lead to sharp convergence rate in the standard mean square distance which improves the results in the literature. Second, we establish, for the first time, the strong convergence of the last iterate with polynomially decaying step sizes in the RKHS norm. We demonstrate that the theoretical analysis established in this paper fully exploits the fine structure of the underlying RKHS, and thus can lead to sharp error estimates of online learning algorithm.
Kosorok M.R., Laber E.B.
2019-03-07 citations by CoLab: 233 Abstract  
Precision medicine seeks to maximize the quality of health care by individualizing the health-care process to the uniquely evolving health status of each patient. This endeavor spans a broad range of scientific areas including drug discovery, genetics/genomics, health communication, and causal inference, all in support of evidence-based, i.e., data-driven, decision making. Precision medicine is formalized as a treatment regime that comprises a sequence of decision rules, one per decision point, which map up-to-date patient information to a recommended action. The potential actions could be the selection of which drug to use, the selection of dose, the timing of administration, the recommendation of a specific diet or exercise, or other aspects of treatment or care. Statistics research in precision medicine is broadly focused on methodological development for estimation of and inference for treatment regimes that maximize some cumulative clinical outcome. In this review, we provide an overview of this vibrant area of research and present important and emerging challenges.
Lin J., Zhou D.
2018-06-01 citations by CoLab: 21 Abstract  
Online learning algorithms in a reproducing kernel Hilbert space associated with convex loss functions are studied. We show that in terms of the expected excess generalization error, they can converge comparably fast as corresponding kernelbased batch learning algorithms. Under mild conditions on loss functions and approximation errors, fast learning rates and finite sample upper bounds are established using polynomially decreasing step-size sequences. For some commonly used loss functions for classification, such as the logistic and the p-norm hinge loss functions with p ∈ [1, 2], the learning rates are the same as those for Tikhonov regularization and can be of order O(T -(1/2) log T), which are nearly optimal up to a logarithmic factor. Our novelty lies in a sharp estimate for the expected values of norms of the learning sequence (or an inductive argument to uniformly bound the expected risks of the learning sequence in expectation) and a refined error decomposition for online learning algorithms.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex
Found error?