Journal of Biomedical Informatics

, volume 110 , pages 103564

Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports

Hong-Jun Yoon ¹

Hilda B Klasky ¹

J. Gounley ¹

Mohammed Alawad ¹

Shang Gao ¹

Eric B. Durbin ²

Xiao-Cheng Wu ³

Antoinette Stroup ⁴

Jennifer Doherty ⁵

Linda Coyle ⁶

Lynne Penberthy ⁷

J Blair Christian ¹

GEORGIA D. TOURASSI ⁸

Hide authors affiliations Show authors affiliations: 8 affiliations

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America. |

College of Medicine, University of Kentucky, Lexington, KY 40536, United States of America. |

Louisiana Tumor Registry, Louisiana State University Health Sciences Center, School of Public Health, New Orleans, LA 70112, United States of America. |

⁴

New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, 08901, United States of America |

⁵

Utah Cancer Registry, University of Utah School of Medicine, Salt Lake City, UT 84132, United States of America. |

⁶

Information Management Services Inc., Calverton, MD 20705, United States of America |

⁷

Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD 20814, United States of America |

⁸

National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America. |

Publication type: Journal Article

Publication date: 2020-10-01

Elsevier

Journal of Biomedical Informatics

scimago Q1

wos Q2

SJR: 1.257

CiteScore: 10.2

Impact factor: 4.5

ISSN: 15320464, 15320480

DOI: 10.1016/j.jbi.2020.103564

Copy DOI

PubMed ID: 32919043

Computer Science Applications

Health Informatics

Abstract

In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. : The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL. • We demonstrated that bagging is an effective way of boosting information extraction performance. • We designed, developed and evaluated two data partitioning approaches. • The proposed approaches alleviate the complexity of classification tasks. • Our results demonstrated significant performance boost in macro-F1 scores. • We performed training deep learning models in parallel on Summit supercomputer.

Found

Top-30

Journals

	1 2
Journal of Biomedical Informatics	Journal of Biomedical Informatics, 2, 15.38% Journal of Biomedical Informatics 2 publications, 15.38%
SPE Journal	SPE Journal, 1, 7.69% SPE Journal 1 publication, 7.69%
Wireless Communications and Mobile Computing	Wireless Communications and Mobile Computing, 1, 7.69% Wireless Communications and Mobile Computing 1 publication, 7.69%
Aslib Journal of Information Management	Aslib Journal of Information Management, 1, 7.69% Aslib Journal of Information Management 1 publication, 7.69%
Frontiers in Oncology	Frontiers in Oncology, 1, 7.69% Frontiers in Oncology 1 publication, 7.69%
JCO clinical cancer informatics	JCO clinical cancer informatics, 1, 7.69% JCO clinical cancer informatics 1 publication, 7.69%
	1 2

Publishers

	1 2
Elsevier	Elsevier, 2, 15.38% Elsevier 2 publications, 15.38%
	, 1, 7.69% 1 publication, 7.69%
Hindawi Limited	Hindawi Limited, 1, 7.69% Hindawi Limited 1 publication, 7.69%
Emerald	Emerald, 1, 7.69% Emerald 1 publication, 7.69%
Frontiers Media S.A.	Frontiers Media S.A., 1, 7.69% Frontiers Media S.A. 1 publication, 7.69%
American Society of Clinical Oncology (ASCO)	American Society of Clinical Oncology (ASCO), 1, 7.69% American Society of Clinical Oncology (ASCO) 1 publication, 7.69%
Institute of Electrical and Electronics Engineers (IEEE)	Institute of Electrical and Electronics Engineers (IEEE), 1, 7.69% Institute of Electrical and Electronics Engineers (IEEE) 1 publication, 7.69%
	1 2

We do not take into account publications without a DOI.
Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST |

Cite this

GOST Copy

Yoon H. et al. Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports // Journal of Biomedical Informatics. 2020. Vol. 110. p. 103564.

GOST all authors (up to 50) Copy

Yoon H., Klasky H. B., Gounley J., Alawad M., Gao S., Durbin E. B., Wu X., Stroup A., Doherty J., Coyle L., Penberthy L., Christian J. B., TOURASSI G. D. Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports // Journal of Biomedical Informatics. 2020. Vol. 110. p. 103564.

RIS |

Cite this

RIS Copy

TY - JOUR

DO - 10.1016/j.jbi.2020.103564

UR - https://doi.org/10.1016/j.jbi.2020.103564

TI - Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports

T2 - Journal of Biomedical Informatics

AU - Yoon, Hong-Jun

AU - Klasky, Hilda B

AU - Gounley, J.

AU - Alawad, Mohammed

AU - Gao, Shang

AU - Durbin, Eric B.

AU - Wu, Xiao-Cheng

AU - Stroup, Antoinette

AU - Doherty, Jennifer

AU - Coyle, Linda

AU - Penberthy, Lynne

AU - Christian, J Blair

AU - TOURASSI, GEORGIA D.

PY - 2020

DA - 2020/10/01

PB - Elsevier

SP - 103564

VL - 110

PMID - 32919043

SN - 1532-0464

SN - 1532-0480

ER -

BibTex

Cite this

BibTex (up to 50 authors) Copy

@article{2020_Yoon,

author = {Hong-Jun Yoon and Hilda B Klasky and J. Gounley and Mohammed Alawad and Shang Gao and Eric B. Durbin and Xiao-Cheng Wu and Antoinette Stroup and Jennifer Doherty and Linda Coyle and Lynne Penberthy and J Blair Christian and GEORGIA D. TOURASSI},

title = {Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports},

journal = {Journal of Biomedical Informatics},

year = {2020},

volume = {110},

publisher = {Elsevier},

month = {oct},

url = {https://doi.org/10.1016/j.jbi.2020.103564},

pages = {103564},

doi = {10.1016/j.jbi.2020.103564}

}

Publisher

Elsevier

Journal

Journal of Biomedical Informatics

scimago Q1

wos Q2

SJR

1.257

CiteScore

10.2

Impact factor

4.5

ISSN

15320464 (Print)

15320480 (Electronic)