Open Access

Eurasip Journal on Audio, Speech, and Music Processing

, volume 2025 , issue 1 , publication number 6

A speech recognition method with enhanced transformer decoder

Hengbo Hu ¹

Tong Niu ²

Zhenhua He ¹

Hide authors affiliations Show authors affiliations: 2 affiliations

Research and Development Department 1 - Intelligent Speech Technology Team, Zhengzhou Xinda Institute of Advanced Technology, Zhengzhou, China |

School of Information Systems Engineering, University of information Engineering, Zhengzhou, China |

Publication type: Journal Article

Publication date: 2025-02-05

Springer Nature

Eurasip Journal on Audio, Speech, and Music Processing

scimago Q2

wos Q2

SJR: 0.417

CiteScore: 4.5

Impact factor: 1.9

ISSN: 16874714, 16874722

DOI: 10.1186/s13636-025-00394-6

Copy DOI

Abstract

Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The enhanced decoder separates and combines the two attention mechanisms in the Transformer decoder into cross-attention layers and a self-attention language model module. The cross-attention layers are utilized to capture local features more efficiently from the encoder output, and the self-attention language model module is used to pre-train with additional domain-related text, followed by cold fusion training. Experimental results on the Mandarin Aishell-1 dataset demonstrate that when the encoder is a Conformer, the enhanced decoder achieves a 16.1% reduction in character error rate compared to the Transformer decoder. Furthermore, when the language model is pre-trained with suitable text data, the performance of the cold fusion-trained model is further enhanced.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

PDF

Metrics

Cite this

GOST |

Cite this

GOST Copy

Hu H. et al. A speech recognition method with enhanced transformer decoder // Eurasip Journal on Audio, Speech, and Music Processing. 2025. Vol. 2025. No. 1. 6

GOST all authors (up to 50) Copy

Hu H., Niu T., He Z. A speech recognition method with enhanced transformer decoder // Eurasip Journal on Audio, Speech, and Music Processing. 2025. Vol. 2025. No. 1. 6

RIS |

Cite this

RIS Copy

TY - JOUR

DO - 10.1186/s13636-025-00394-6

UR - https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-025-00394-6

TI - A speech recognition method with enhanced transformer decoder

T2 - Eurasip Journal on Audio, Speech, and Music Processing

AU - Hu, Hengbo

AU - Niu, Tong

AU - He, Zhenhua

PY - 2025

DA - 2025/02/05

PB - Springer Nature

IS - 1

VL - 2025

SN - 1687-4714

SN - 1687-4722

ER -

BibTex

Cite this

BibTex (up to 50 authors) Copy

@article{2025_Hu,

author = {Hengbo Hu and Tong Niu and Zhenhua He},

title = {A speech recognition method with enhanced transformer decoder},

journal = {Eurasip Journal on Audio, Speech, and Music Processing},

year = {2025},

volume = {2025},

publisher = {Springer Nature},

month = {feb},

url = {https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-025-00394-6},

number = {1},

pages = {6},

doi = {10.1186/s13636-025-00394-6}

}

Publisher

Springer Nature

Journal

Eurasip Journal on Audio, Speech, and Music Processing

scimago Q2

wos Q2

SJR

0.417

CiteScore

4.5

Impact factor

1.9

ISSN

16874714 (Print)

16874722 (Electronic)