ACM Transactions on Asian and Low-Resource Language Information Processing, volume 24, issue 4, pages 1-26

New Bagging Based Ensemble Learning Algorithm Distinguishing Short and Long Texts for Document Classification

Youwei Wang ¹

Lizhou Feng ²

Hide authors affiliations

Central University of Finance and Economics, Beijing, China |

Tianjin University of Finance and Economics, Tianjin, China |

Publication type: Journal Article

Publication date: 2025-03-23

Association for Computing Machinery (ACM)

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing

scimago Q2

SJR: 0.535

CiteScore: 3.6

Impact factor: 1.8

ISSN: 23754699, 23754702

DOI: 10.1145/3718740

Copy DOI

Abstract

To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the performances of different typical deep learning methods on processing long and short texts are compared, and the optimal base classifiers for long and short texts are selected respectively. Second, the random sampling method in traditional bagging classification algorithms is improved, and a threshold group based random sampling method which can balance the numbers of long and short text subsets is proposed. Moreover, to improve the model inference speed and classification accuracy, the training of long and short text subsets is realized by combining the knowledge distillation theory. Finally, the sample classification probabilities on different categories are considered, and the category similarity information is combined with the traditional weighted voting classifier ensemble method to avoid the problem that the sampling process may decrease the accuracy. Experimental results on multiple datasets show that the algorithm can effectively improve the accuracy of document classification and has obvious advantages over typical deep learning algorithms and ensemble learning algorithms.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST | RIS | BibTex | MLA

Found error?

Publisher

Association for Computing Machinery (ACM)

Journal

ACM Transactions on Asian and Low-Resource Language Information Processing

scimago Q2

SJR

0.535

CiteScore

3.6

Impact factor

1.8

ISSN

23754699 (Print)

23754702 (Electronic)