Computational Statistics

A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity

Javier E. Flores ¹

Joseph E. Cavanaugh ²

Andrew A Neath ³

Hide authors affiliations Show authors affiliations: 3 affiliations

Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, USA |

Department of Biostatistics, University of Iowa, Iowa City, USA |

Department of Mathematics and Statistics, Southern Illinois University Edwardsville, Edwardsville, USA |

Publication type: Journal Article

Publication date: 2024-10-02

Springer Nature

Computational Statistics

scimago Q2

wos Q2

SJR: 0.750

CiteScore: 3.0

Impact factor: 1.4

ISSN: 09434062, 16139658

DOI: 10.1007/s00180-024-01559-1

Copy DOI

Abstract

Information criteria provide a cogent approach for identifying models that provide an optimal balance between the competing objectives of goodness-of-fit and parsimony. Models that better conform to a dataset are often more complex, yet such models are plagued by greater variability in estimation and prediction. Conversely, overly simplistic models reduce variability at the cost of increases in bias. Asymptotically efficient criteria are those that, for large samples, select the fitted candidate model whose predictors minimize the mean squared prediction error, optimizing between prediction bias and variability. In the context of prediction, asymptotically efficient criteria are thus a preferred tool for model selection, with the Akaike information criterion (AIC) being among the most widely used. However, asymptotic efficiency relies upon the assumption of a panel of validation data generated independently from, but identically to, the set of training data. We argue that assuming identically distributed training and validation data is misaligned with the premise of prediction and often violated in practice. This is most apparent in a regression context, where assuming training/validation data homogeneity requires identical panels of regressors. We therefore develop a new class of predictive information criteria (PIC) that do not assume training/validation data homogeneity and are shown to generalize AIC to the more practically relevant setting of training/validation data heterogeneity. The analytic properties and predictive performance of these new criteria are explored within the traditional regression framework. We consider both simulated and real-data settings. Software for implementing these methods is provided in the R package, picR, available through CRAN.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST |

Cite this

GOST Copy

Flores J. E. et al. A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity // Computational Statistics. 2024.

GOST all authors (up to 50) Copy

Flores J. E., Cavanaugh J. E., Neath A. A. A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity // Computational Statistics. 2024.

RIS |

Cite this

RIS Copy

TY - JOUR

DO - 10.1007/s00180-024-01559-1

UR - https://link.springer.com/10.1007/s00180-024-01559-1

TI - A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity

T2 - Computational Statistics

AU - Flores, Javier E.

AU - Cavanaugh, Joseph E.

AU - Neath, Andrew A

PY - 2024

DA - 2024/10/02

PB - Springer Nature

SN - 0943-4062

SN - 1613-9658

ER -

BibTex

Cite this

BibTex (up to 50 authors) Copy

@article{2024_Flores,

author = {Javier E. Flores and Joseph E. Cavanaugh and Andrew A Neath},

title = {A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity},

journal = {Computational Statistics},

year = {2024},

publisher = {Springer Nature},

month = {oct},

url = {https://link.springer.com/10.1007/s00180-024-01559-1},

doi = {10.1007/s00180-024-01559-1}

}

Publisher

Springer Nature

Journal

Computational Statistics

scimago Q2

wos Q2

SJR

0.750

CiteScore

3.0

Impact factor

1.4

ISSN

09434062 (Print)

16139658 (Electronic)