volume 1297 pages 342375

A general procedure for finding potentially erroneous entries in the database of retention indices

Publication typeJournal Article
Publication date2024-04-01
scimago Q1
wos Q1
SJR1.004
CiteScore10.4
Impact factor6.0
ISSN00032670, 18734324
Biochemistry
Spectroscopy
Analytical Chemistry
Environmental Chemistry
Abstract
The NIST retention index database is one the most widely used sources of retention indices. In both untargeted analysis and machine learning studies filtering for potential errors is rather lacking or nonexistent. According to our estimates about 80% of the compounds from both NIST 17 and NIST 20 retention index databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection is also impractical because the database contains more than 300 000 entries. We suggest a two-step procedure to find potentially erroneous retention indices based on machine learning. The first step is to use five predictive models to obtain predicted retention index values for the whole database. The second one is to compare these predicted values against the experimental ones. We consider a retention index erroneous if its accuracy (the difference between predicted and experimental value) is in the bottom 5% for each of the five models simultaneously. Using this method, we were able to detect 2093 outlier entries for standard and semi-standard non-polar stationary phases in the NIST 17 retention index database, 566 of those were corrected or removed by the developers in the NIST 20. This is a novel approach to find potentially erroneous entries in a large-scale database with mostly unique entries, which can be applied not only to retention indices. The procedure can help filter and report mishandled data to improve the quality of the dataset for machine learning applications and experimental use.
Found 
Found 

Top-30

Journals

1
Chemosphere
1 publication, 12.5%
Analytical and Bioanalytical Chemistry
1 publication, 12.5%
Journal of Separation Science
1 publication, 12.5%
Analytica—A Journal of Analytical Chemistry and Chemical Analysis
1 publication, 12.5%
Langmuir
1 publication, 12.5%
Journal of Chromatography A
1 publication, 12.5%
Separations
1 publication, 12.5%
Russian Journal of Physical Chemistry A
1 publication, 12.5%
1

Publishers

1
2
Elsevier
2 publications, 25%
MDPI
2 publications, 25%
Springer Nature
1 publication, 12.5%
Wiley
1 publication, 12.5%
American Chemical Society (ACS)
1 publication, 12.5%
Pleiades Publishing
1 publication, 12.5%
1
2
  • We do not take into account publications without a DOI.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
8
Share
Cite this
GOST |
Cite this
GOST Copy
Khrisanfov M. et al. A general procedure for finding potentially erroneous entries in the database of retention indices // Analytica Chimica Acta. 2024. Vol. 1297. p. 342375.
GOST all authors (up to 50) Copy
Khrisanfov M., Matyushin D. D., Samokhin A. S. A general procedure for finding potentially erroneous entries in the database of retention indices // Analytica Chimica Acta. 2024. Vol. 1297. p. 342375.
RIS |
Cite this
RIS Copy
TY - JOUR
DO - 10.1016/j.aca.2024.342375
UR - https://linkinghub.elsevier.com/retrieve/pii/S0003267024001764
TI - A general procedure for finding potentially erroneous entries in the database of retention indices
T2 - Analytica Chimica Acta
AU - Khrisanfov, Mikhail
AU - Matyushin, Dmitriy D
AU - Samokhin, Andrey S
PY - 2024
DA - 2024/04/01
PB - Elsevier
SP - 342375
VL - 1297
PMID - 38438243
SN - 0003-2670
SN - 1873-4324
ER -
BibTex
Cite this
BibTex (up to 50 authors) Copy
@article{2024_Khrisanfov,
author = {Mikhail Khrisanfov and Dmitriy D Matyushin and Andrey S Samokhin},
title = {A general procedure for finding potentially erroneous entries in the database of retention indices},
journal = {Analytica Chimica Acta},
year = {2024},
volume = {1297},
publisher = {Elsevier},
month = {apr},
url = {https://linkinghub.elsevier.com/retrieve/pii/S0003267024001764},
pages = {342375},
doi = {10.1016/j.aca.2024.342375}
}