Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?
Publication type: Journal Article
Publication date: 2020-02-21
scimago Q1
wos Q1
SJR: 1.467
CiteScore: 9.8
Impact factor: 5.3
ISSN: 15499596, 1549960X
PubMed ID:
32085675
General Chemistry
Computer Science Applications
General Chemical Engineering
Library and Information Sciences
Abstract
In recent years, protein-ligand interaction scoring functions derived through machine-learning are repeatedly reported to outperform conventional scoring functions. However, several published studies have questioned that the superior performance of machine-learning scoring functions is dependent on the between the training set and the test set. In order to examine the true power of machine-learning algorithms in scoring function formulation, we have conducted a systematic study of six off-the-shelf machine-learning algorithms, including Bayesian Ridge Regression (BRR), Decision Tree (DT), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Linear Support Vector Regression (L-SVR), and Random Forest (RF). Model scoring functions were derived with these machine-learning algorithms on various training sets selected from over 3700 protein-ligand complexes in the PDBbind refined set (version 2016). All resulting scoring functions were then applied to the CASF-2016 test set to validate their scoring power. In our first series of trial, the size of the training set was fixed; while the overall similarity between the training set and the test set was varied systematically. In our second series of trial, the overall similarity between the training set and the test set was fixed, while the size of the training set was varied. Our results indicate that the performance of those machine-learning models are more or less dependent on the contents or the size of the training set, where the RF model demonstrates the best learning capability. In contrast, the performance of three conventional scoring functions (i.e., ChemScore, ASP, and X-Score) is basically insensitive to the use of different training sets. Therefore, one has to consider not only hard overlap but also soft overlap between the training set and the test set in order to evaluate machine-learning scoring functions. In this spirit, we have complied data sets based on the PDBbind refined set by removing redundant samples under several similarity thresholds. Scoring functions developers are encouraged to employ them as standard training sets if they want to evaluate their new models on the CASF-2016 benchmark.
Found
Nothing found, try to update filter.
Found
Nothing found, try to update filter.
Top-30
Journals
|
2
4
6
8
10
12
14
16
|
|
|
Journal of Chemical Information and Modeling
15 publications, 21.13%
|
|
|
Briefings in Bioinformatics
5 publications, 7.04%
|
|
|
Scientific Reports
4 publications, 5.63%
|
|
|
Journal of Cheminformatics
3 publications, 4.23%
|
|
|
ACS Omega
3 publications, 4.23%
|
|
|
Drug Discovery Today
2 publications, 2.82%
|
|
|
Physical Chemistry Chemical Physics
2 publications, 2.82%
|
|
|
International Journal of Molecular Sciences
1 publication, 1.41%
|
|
|
Molecular Informatics
1 publication, 1.41%
|
|
|
Molecules
1 publication, 1.41%
|
|
|
Frontiers in Molecular Biosciences
1 publication, 1.41%
|
|
|
Frontiers in Bioinformatics
1 publication, 1.41%
|
|
|
Computers
1 publication, 1.41%
|
|
|
BMC Bioinformatics
1 publication, 1.41%
|
|
|
Chemical Physics Letters
1 publication, 1.41%
|
|
|
Analytica Chimica Acta
1 publication, 1.41%
|
|
|
Journal of Molecular Graphics and Modelling
1 publication, 1.41%
|
|
|
Journal of Medicinal Chemistry
1 publication, 1.41%
|
|
|
Expert Opinion on Drug Discovery
1 publication, 1.41%
|
|
|
Saudi Dental Journal
1 publication, 1.41%
|
|
|
Chemical Science
1 publication, 1.41%
|
|
|
Analytical Chemistry
1 publication, 1.41%
|
|
|
Proteins: Structure, Function and Genetics
1 publication, 1.41%
|
|
|
Machine Learning: Science and Technology
1 publication, 1.41%
|
|
|
Journal of Physical Chemistry B
1 publication, 1.41%
|
|
|
Mendeleev Communications
1 publication, 1.41%
|
|
|
Digital Discovery
1 publication, 1.41%
|
|
|
Wiley Interdisciplinary Reviews: Computational Molecular Science
1 publication, 1.41%
|
|
|
Nature Machine Intelligence
1 publication, 1.41%
|
|
|
2
4
6
8
10
12
14
16
|
Publishers
|
5
10
15
20
25
|
|
|
American Chemical Society (ACS)
21 publications, 29.58%
|
|
|
Springer Nature
13 publications, 18.31%
|
|
|
Elsevier
6 publications, 8.45%
|
|
|
Oxford University Press
6 publications, 8.45%
|
|
|
Royal Society of Chemistry (RSC)
5 publications, 7.04%
|
|
|
Wiley
4 publications, 5.63%
|
|
|
MDPI
3 publications, 4.23%
|
|
|
Cold Spring Harbor Laboratory
3 publications, 4.23%
|
|
|
Frontiers Media S.A.
2 publications, 2.82%
|
|
|
Taylor & Francis
1 publication, 1.41%
|
|
|
King Saud University
1 publication, 1.41%
|
|
|
IOP Publishing
1 publication, 1.41%
|
|
|
OOO Zhurnal "Mendeleevskie Soobshcheniya"
1 publication, 1.41%
|
|
|
IntechOpen
1 publication, 1.41%
|
|
|
Institute of Electrical and Electronics Engineers (IEEE)
1 publication, 1.41%
|
|
|
International Press of Boston
1 publication, 1.41%
|
|
|
5
10
15
20
25
|
- We do not take into account publications without a DOI.
- Statistics recalculated weekly.
Are you a researcher?
Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
71
Total citations:
71
Citations from 2025:
7
(9.86%)
Cite this
GOST |
RIS |
BibTex |
MLA
Cite this
GOST
Copy
Su M. et al. Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set? // Journal of Chemical Information and Modeling. 2020. Vol. 60. No. 3. pp. 1122-1136.
GOST all authors (up to 50)
Copy
Su M., Feng G., Liu Z., Li Y., Wang R. Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set? // Journal of Chemical Information and Modeling. 2020. Vol. 60. No. 3. pp. 1122-1136.
Cite this
RIS
Copy
TY - JOUR
DO - 10.1021/acs.jcim.9b00714
UR - https://doi.org/10.1021/acs.jcim.9b00714
TI - Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?
T2 - Journal of Chemical Information and Modeling
AU - Su, Minyi
AU - Feng, Guoqin
AU - Liu, Zhihai
AU - Li, Yan
AU - Wang, Renxiao
PY - 2020
DA - 2020/02/21
PB - American Chemical Society (ACS)
SP - 1122-1136
IS - 3
VL - 60
PMID - 32085675
SN - 1549-9596
SN - 1549-960X
ER -
Cite this
BibTex (up to 50 authors)
Copy
@article{2020_Su,
author = {Minyi Su and Guoqin Feng and Zhihai Liu and Yan Li and Renxiao Wang},
title = {Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?},
journal = {Journal of Chemical Information and Modeling},
year = {2020},
volume = {60},
publisher = {American Chemical Society (ACS)},
month = {feb},
url = {https://doi.org/10.1021/acs.jcim.9b00714},
number = {3},
pages = {1122--1136},
doi = {10.1021/acs.jcim.9b00714}
}
Cite this
MLA
Copy
Su, Minyi, et al. “Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?.” Journal of Chemical Information and Modeling, vol. 60, no. 3, Feb. 2020, pp. 1122-1136. https://doi.org/10.1021/acs.jcim.9b00714.