Open Access

Eurasip Journal on Audio, Speech, and Music Processing, volume 2025, issue 1, publication number 9

Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN

Li Pinyan ¹

Lap Man Hoi ¹

Yapeng Wang ¹

Yang Xu ¹

Sio Kei Im ²

Hide authors affiliations

Faculty of Applied Sciences, Macao Polytechnic University, Macao, China |

Macao Polytechnic University, Macao, China |

Publication type: Journal Article

Publication date: 2025-02-14

Springer Nature

Journal: Eurasip Journal on Audio, Speech, and Music Processing

scimago Q2

SJR: 0.414

CiteScore: 4.1

Impact factor: 1.7

ISSN: 16874714, 16874722

DOI: 10.1186/s13636-025-00396-4

Copy DOI

Abstract

In today’s society, speaker recognition plays an increasingly important role. Currently, neural networks are widely employed for extracting speaker features. Although the Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) model can obtain temporal context information through dilated convolution to some extent, this model falls short in acquiring fully comprehensive speech features. To further improve the accuracy of the model, better capture the temporal context information, and make ECAPA-TDNN unaffected by small offsets in the frequency domain, based on the ECAPA-TDNN model, we combine a two-dimensional convolutional network (Conv2D), a residual network (ResNet), and ECAPA-TDNN to form a novel CRET model. In this study, two CRET models are proposed, and these two models are compared with the baseline models Multi-Scale Backbone Architecture (Res2Net) and ECAPA-TDNN in different channels and different datasets. The experimental findings indicate that our proposed models exhibit strong performance across various experiments conducted on both training and test sets, even when the network layer is deep. Our model performs the best on the VoxCeleb2 dataset with 1024 channels, achieving an accuracy of 0.97828, an equal error rate (EER) of 0.03612 on the VoxCeleb1-O dataset, and a minimum detection cost function (MinDCF) of 0.43967. This technology can improve public safety and service efficiency in smart city construction, promote finance, education, and other fields, and bring more convenience to people's lives.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Publication PDF

Metrics

Cite this

GOST | RIS | BibTex

Found error?

Publisher

Springer Nature

Journal

Eurasip Journal on Audio, Speech, and Music Processing

scimago Q2

SJR

0.414

CiteScore

4.1

Impact factor

1.7

ISSN

16874714 (Print)

16874722 (Electronic)