Open Access
Open access
Journal of Low Power Electronics and Applications, volume 15, issue 1, pages 15

Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge

Yicheng Zhang 1, 2
Bojian Yin 2
Manil Dev Gomony 2
Henk Corporaal 2
Carsten Trinitis 1
Federico Corradi 2
1
 
Computer Architecture and Operating System (CAOS), Technical University of Munich (TUM), Bildungcampus 2, 74076 Heilbronn, Germany
2
 
Electronic Systems, Eindhoven University of Technology (TU/e), Flux, Groene Loper 19, 5612 AP Eindhoven, The Netherlands
Publication typeJournal Article
Publication date2025-03-11
scimago Q3
SJR0.375
CiteScore3.6
Impact factor1.6
ISSN20799268
Abstract

Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech and gesture recognition, which often necessitate the use of recurrent neural networks (RNNs). However, training RNNs on edge devices presents major challenges due to limited memory and computing resources. In this study, we propose a system for RNN training through sequence partitioning using the Forward Propagation Through Time (FPTT) training method, thereby enabling edge learning. Our optimized hardware/software co-design for FPTT represents a novel contribution in this domain. This research demonstrates the viability of FPTT for fine-tuning real-world applications by implementing a complete computational framework for training Long Short-Term Memory (LSTM) networks utilizing FPTT. Moreover, this work incorporates the optimization and exploration of a scalable digital hardware architecture using an open-source hardware-design framework, named Chipyard and its implementation on a Field-Programmable Gate Array (FPGA) for cycle-accurate verification. The empirical results demonstrate that partitioned training on the proposed architecture enables an 8.2-fold reduction in memory usage with only a 0.2× increase in latency for small-batch sequential MNIST (S-MNIST) compared to traditional non-partitioned training.

Found 

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?