Hardware/Software Co-Design Optimization for Training Recurrent Neural Networks at the Edge
Edge devices execute pre-trained Artificial Intelligence (AI) models optimized on large Graphical Processing Units (GPUs); however, they frequently require fine-tuning when deployed in the real world. This fine-tuning, referred to as edge learning, is essential for personalized tasks such as speech and gesture recognition, which often necessitate the use of recurrent neural networks (RNNs). However, training RNNs on edge devices presents major challenges due to limited memory and computing resources. In this study, we propose a system for RNN training through sequence partitioning using the Forward Propagation Through Time (FPTT) training method, thereby enabling edge learning. Our optimized hardware/software co-design for FPTT represents a novel contribution in this domain. This research demonstrates the viability of FPTT for fine-tuning real-world applications by implementing a complete computational framework for training Long Short-Term Memory (LSTM) networks utilizing FPTT. Moreover, this work incorporates the optimization and exploration of a scalable digital hardware architecture using an open-source hardware-design framework, named Chipyard and its implementation on a Field-Programmable Gate Array (FPGA) for cycle-accurate verification. The empirical results demonstrate that partitioned training on the proposed architecture enables an 8.2-fold reduction in memory usage with only a 0.2× increase in latency for small-batch sequential MNIST (S-MNIST) compared to traditional non-partitioned training.