Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge

Yicheng Zhang ^{1, 2}

Bojian Yin ²

Manil Dev Gomony ²

Henk Corporaal ²

Carsten Trinitis ¹

Federico Corradi ²

Hide authors affiliations

Computer Architecture and Operating System (CAOS), Technical University of Munich (TUM), Bildungcampus 2, 74076 Heilbronn, Germany

Electronic Systems, Eindhoven University of Technology (TU/e), Flux, Groene Loper 19, 5612 AP Eindhoven, The Netherlands |

Publication type: Journal Article

Publication date: 2025-03-11

MDPI

Journal: Journal of Low Power Electronics and Applications

scimago Q3

SJR: 0.375

CiteScore: 3.6

Impact factor: 1.6

ISSN: 20799268

DOI: 10.3390/jlpea15010015

Copy DOI

Abstract

Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech and gesture recognition, which often necessitate the use of recurrent neural networks (RNNs). However, training RNNs on edge devices presents major challenges due to limited memory and computing resources. In this study, we propose a system for RNN training through sequence partitioning using the Forward Propagation Through Time (FPTT) training method, thereby enabling edge learning. Our optimized hardware/software co-design for FPTT represents a novel contribution in this domain. This research demonstrates the viability of FPTT for fine-tuning real-world applications by implementing a complete computational framework for training Long Short-Term Memory (LSTM) networks utilizing FPTT. Moreover, this work incorporates the optimization and exploration of a scalable digital hardware architecture using an open-source hardware-design framework, named Chipyard and its implementation on a Field-Programmable Gate Array (FPGA) for cycle-accurate verification. The empirical results demonstrate that partitioned training on the proposed architecture enables an 8.2-fold reduction in memory usage with only a 0.2× increase in latency for small-batch sequential MNIST (S-MNIST) compared to traditional non-partitioned training.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Publication PDF

Metrics

Cite this

GOST | RIS | BibTex | MLA

Found error?

Publisher

MDPI

Journal

Journal of Low Power Electronics and Applications

scimago Q3

SJR

0.375

CiteScore

3.6

Impact factor

1.6

ISSN

20799268 (Electronic)