International Journal of Software Engineering and Knowledge Engineering, volume 34, issue 12, pages 1949-1970

Boosting Commit Classification Based on Multivariate Mixed Features and Heterogeneous Classifier Selection

Yuhan Wu 1
Yingling Li 1
Ziao Wang 1
Qushan Tan 2
Jing Liu 3
Yuao Jiang 1
2
 
Sichuan Digital Transportation Technology, Co., Ltd., Chengdu, P. R. China
3
 
Chengdu Branch of Sichuan Chengmian, Cangba Expressway Co., Ltd., Chengdu 641400, P. R. China
Publication typeJournal Article
Publication date2024-09-30
scimago Q3
SJR0.251
CiteScore1.9
Impact factor0.6
ISSN02181940, 17936403
Abstract

Commit classification plays a crucial role in software maintenance, as it permits developers to make informed decisions regarding resource allocation and code review. There are several approaches for automatic commit classification, yet they do not sufficiently explore the features related to commits and consider the advantages of ensemble models over individual models. Therefore, there is some room for improvement. In this paper, we propose MuheCC, a commit classification approach based on multivariate mixed features and heterogeneous classifier selection to address these challenges. It mainly consists of three phases: (1) Multivariate mixed feature extraction, which extracts features from commit messages, changed code and handcrafted features to construct comprehensive mixed features; (2) Hyperparameter tuning based on genetic algorithm, which utilizes genetic algorithm to optimize candidate traditional models and ensemble models; (3) Heterogeneous classifier selection, which selects the optimal combinations of traditional and ensemble models, respectively, to build a heterogeneous classifier for commit classification. To evaluate this approach, we extend an existing dataset with code changes for each commit and compare MuheCC with three baselines on this real-world dataset. The results show that MuheCC outperforms all baselines, especially improving the best baseline by 7.25% for accuracy, 6.88% for precision, 7.25% for recall and 7.06% for [Formula: see text]-score. Furthermore, the ablation experiments validate that the performance advantage of MuheCC is mainly attributed to the multivariate features (e.g. 12.55% contributions to accuracy) and the heterogeneous classifier (e.g. 12.26% contributions to accuracy). We further discuss the impact of hyperparameter tuning and heterogeneous classifier selection on the performance of MuheCC. These results prove the superiority and potential practical value of MuheCC.

Found 
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?