Feature Learning via Correlation Analysis for Effective Duplicate Detection
With the growing reliance on software, the frequency of software bugs has increased significantly. To address these issues, users or developers typically submit bug reports, which developers analyze and resolve. However, many submitted bug reports are duplicates of previously reported issues, creating inefficiencies in the bug resolution process. To enhance developer productivity, an automatic method for detecting duplicate bug reports is essential. In this study, we present a novel approach for identifying duplicate and nonduplicate bug reports using feature learning through correlation analysis. Our method utilizes bug report features, including product and component information, extracted from bug repositories. The process begins with preprocessing the bug reports to ensure data quality. Next, a feature selection algorithm identifies relevant features, which are then used to train a machine learning model based on bidirectional encoder representations from transformers (BERT). The proposed model’s effectiveness was evaluated across multiple datasets: Apache, JDT, Platform, KDE, Core, Firefox, and Thunderbird. Our results show detection accuracies of 91.41%, 88.66%, 86.08%, 92.94%, 90.68%, 88.25%, and 91.62%, respectively. These outcomes represent a significant improvement of 32% to 41% compared to baseline models, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), convolutional LSTMs (CNN-LSTMs), Naive Bayes classifiers, and random forest classifiers. Our findings show that the proposed model is highly effective for duplicate bug report prediction and offers substantial advancements over existing methods. This approach has the potential to streamline bug management processes and improve overall software development efficiency.