ACM Computing Surveys, volume 57, issue 8, pages 1-35

A Review on Edge Large Language Models: Design, Execution, and Applications

Publication typeJournal Article
Publication date2025-03-23
scimago Q1
SJR6.280
CiteScore33.2
Impact factor23.8
ISSN03600300, 15577341
Abstract

Large language models (LLMs) have revolutionized natural language processing with their exceptional understanding, synthesizing, and reasoning capabilities. However, deploying LLMs on resource-constrained edge devices presents significant challenges due to computational limitations, memory constraints, and edge hardware heterogeneity. This survey provides a comprehensive overview of recent advancements in edge LLMs, covering the entire lifecycle — from resource-efficient model design and pre-deployment strategies to runtime inference optimizations. It also explores on-device applications across various domains. By synthesizing state-of-the-art techniques and identifying future research directions, this survey bridges the gap between the immense potential of LLMs and the constraints of edge computing.

Wang R., Jing Y., Gu C., He S., Chen J.
IEEE Internet of Things Journal scimago Q1 wos Q1
2025-02-15 citations by CoLab: 2
Sridharan S., Stevens J.R., Roy K., Raghunathan A.
2023-08-01 citations by CoLab: 14
Jiang S., Lin Z., Li Y., Shu Y., Liu Y.
2021-10-25 citations by CoLab: 66 Abstract  
Object detection is a fundamental building block of video analytics applications. While Neural Networks (NNs)-based object detection models have shown excellent accuracy on benchmark datasets, they are not well positioned for high-resolution images inference on resource-constrained edge devices. Common approaches, including down-sampling inputs and scaling up neural networks, fall short of adapting to video content changes and various latency requirements. This paper presents Remix, a flexible framework for high-resolution object detection on edge devices. Remix takes as input a latency budget, and come up with an image partition and model execution plan which runs off-the-shelf neural networks on non-uniformly partitioned image blocks. As a result, it maximizes the overall detection accuracy by allocating various amount of compute power onto different areas of an image. We evaluate Remix on public dataset as well as real-world videos collected by ourselves. Experimental results show that Remix can either improve the detection accuracy by 18%-120% for a given latency budget, or achieve up to 8.1× inference speedup with accuracy on par with the state-of-the-art NNs.
Lalapura V.S., Amudha J., Satheesh H.S.
ACM Computing Surveys scimago Q1 wos Q1
2021-05-24 citations by CoLab: 54 Abstract  
Recurrent Neural Networks are ubiquitous and pervasive in many artificial intelligence applications such as speech recognition, predictive healthcare, creative art, and so on. Although they provide accurate superior solutions, they pose a massive challenge “training havoc.” Current expansion of IoT demands intelligent models to be deployed at the edge. This is precisely to handle increasing model sizes and complex network architectures. Design efforts to meet these for greater performance have had inverse effects on portability on edge devices with real-time constraints of memory, latency, and energy. This article provides a detailed insight into various compression techniques widely disseminated in the deep learning regime. They have become key in mapping powerful RNNs onto resource-constrained devices. While compression of RNNs is the main focus of the survey, it also highlights challenges encountered while training. The training procedure directly influences model performance and compression alongside. Recent advancements to overcome the training challenges with their strengths and drawbacks are discussed. In short, the survey covers the three-step process, namely, architecture selection, efficient training process, and suitable compression technique applicable to a resource-constrained environment. It is thus one of the comprehensive survey guides a developer can adapt for a time-series problem context and an RNN solution for the edge.
citations by CoLab: 31
Wang X., Tang Z., Guo J., Meng T., Wang C., Wang T., Jia W.
ACM Computing Surveys scimago Q1 wos Q1
2025-04-04 citations by CoLab: 0 Abstract  
The rapid advancement of artificial intelligence (AI) technologies has led to an increasing deployment of AI models on edge and terminal devices, driven by the proliferation of the Internet of Things (IoT) and the need for real-time data processing. This survey comprehensively explores the current state, technical challenges, and future trends of on-device AI models. We define on-device AI models as those designed to perform local data processing and inference, emphasizing their characteristics such as real-time performance, resource constraints, and enhanced data privacy. The survey is structured around key themes, including the fundamental concepts of AI models, application scenarios across various domains, and the technical challenges faced in edge environments. We also discuss optimization and implementation strategies, such as data preprocessing, model compression, and hardware acceleration, which are essential for effective deployment. Furthermore, we examine the impact of emerging technologies, including edge computing and foundation models, on the evolution of on-device AI models. By providing a structured overview of the challenges, solutions, and future directions, this survey aims to facilitate further research and application of on-device AI, ultimately contributing to the advancement of intelligent systems in everyday life.
Giorgetti G., Pau D.P.
2025-03-06 citations by CoLab: 0 PDF Abstract  
Generative AI (GenAI) models are designed to produce realistic and natural data, such as images, audio, or written text. Due to their high computational and memory demands, these models traditionally run on powerful remote compute servers. However, there is growing interest in deploying GenAI models at the edge, on resource-constrained embedded devices. Since 2018, the TinyML community has proved that running fixed topology AI models on edge devices offers several benefits, including independence from internet connectivity, low-latency processing, and enhanced privacy. Nevertheless, deploying resource-consuming GenAI models on embedded devices is challenging since the latter have limited computational, memory, and energy resources. This review paper aims to evaluate the progresses made to date in the field of Edge GenAI, an emerging area of research within the broader domain of EdgeAI which focuses on bringing GenAI on edge devices. Papers released between 2022 and 2024 that address the design and deployment of GenAI models on embedded devices are identified and described. Additionally, their approaches and results are compared. This manuscript contributes to understand the ongoing transition from TinyML to Edge GenAI and provides valuable insights to the AI research community on this emerging, impactful, and quite under-explored field.

Top-30

Journals

1
1

Publishers

1
1
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?
Profiles