New Bagging Based Ensemble Learning Algorithm Distinguishing Short and Long Texts for Document Classification
To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the performances of different typical deep learning methods on processing long and short texts are compared, and the optimal base classifiers for long and short texts are selected respectively. Second, the random sampling method in traditional bagging classification algorithms is improved, and a threshold group based random sampling method which can balance the numbers of long and short text subsets is proposed. Moreover, to improve the model inference speed and classification accuracy, the training of long and short text subsets is realized by combining the knowledge distillation theory. Finally, the sample classification probabilities on different categories are considered, and the category similarity information is combined with the traditional weighted voting classifier ensemble method to avoid the problem that the sampling process may decrease the accuracy. Experimental results on multiple datasets show that the algorithm can effectively improve the accuracy of document classification and has obvious advantages over typical deep learning algorithms and ensemble learning algorithms.