ACM Transactions on Modeling and Performance Evaluation of Computing Systems, volume 3, issue 2, pages 1-26

RAPL in Action

Publication typeJournal Article
Publication date2018-03-22
scimago Q2
SJR0.525
CiteScore2.1
Impact factor0.7
ISSN23763639, 23763647
Computer Science (miscellaneous)
Hardware and Architecture
Information Systems
Computer Networks and Communications
Software
Safety, Risk, Reliability and Quality
Media Technology
Abstract

To improve energy efficiency and comply with the power budgets, it is important to be able to measure the power consumption of cloud computing servers. Intel’s Running Average Power Limit (RAPL) interface is a powerful tool for this purpose. RAPL provides power limiting features and accurate energy readings for CPUs and DRAM, which are easily accessible through different interfaces on large distributed computing systems. Since its introduction, RAPL has been used extensively in power measurement and modeling. However, the advantages and disadvantages of RAPL have not been well investigated yet. To fill this gap, we conduct a series of experiments to disclose the underlying strengths and weaknesses of the RAPL interface by using both customized microbenchmarks and three well-known application level benchmarks: Stream , Stress-ng , and ParFullCMS . Moreover, to make the analysis as realistic as possible, we leverage two production-level power measurement datasets from the Taito , a supercomputing cluster of the Finnish Center of Scientific Computing and also replicate our experiments on Amazon EC2. Our results illustrate different aspects of RAPL and document the findings through comprehensive analysis. Our observations reveal that RAPL readings are highly correlated with plug power, promisingly accurate enough, and have negligible performance overhead. Experimental results suggest RAPL can be a very useful tool to measure and monitor the energy consumption of servers without deploying any complex power meters. We also show that there are still some open issues, such as driver support, non-atomicity of register updates, and unpredictable timings that might weaken the usability of RAPL in certain scenarios. For such scenarios, we pinpoint solutions and workarounds.

Desrochers S., Paradis C., Weaver V.M.
2016-10-03 citations by CoLab: 75 Abstract  
Recent Intel processors support the Running Average Power Level (RAPL) interface, which among other things provides estimated energy measurements for the CPUs, integrated GPU, and DRAM. These measurements are easily accessible by the user, and can be gathered by a wide variety of tools, including the Linux perf_event interface. This allows unprecedented easy access to energy information when designing and optimizing energy-aware code.
Khan K.N., Ou Z., Hirki M., Nurminen J.K., Niemi T.
2016-08-08 citations by CoLab: 10 Abstract  
Full system electricity intake from the wall socket is important for understanding and budgeting the power consumption of large scale data centers. Measuring full system power, however, requires extra instrumentation with external physical devices, which is not only cumbersome, but also expensive and time consuming. To tackle this problem, in this paper, we propose to model wall socket power from processor package power obtained from the running average power limit (RAPL) interface, which is available on the latest Intel processors. Our experimental results demonstrate a strong correlation between RAPL package power and wall socket power consumption. Based on the observations, we propose an empirical power model to predict the full system power. We verify the model using multiple synthetic benchmarks (Stress-ng, STREAM), high energy physics benchmark (ParFullCMS), and non-trivial application benchmarks (Parsec). Experimental results show that the prediction model achieves good accuracy, which is maximum 5.6 % error rate.
Kelley J., Stewart C., Tiwari D., Gupta S.
2016-07-01 citations by CoLab: 8 Abstract  
State of the art schedulers use workload profiles to help determine which resources to allocate. Traditionally, threads execute on every available core, but increasingly, too much power is consumed by using every core. Because peak power can occur at any point in time during the workload, workloads are commonly profiled to completion multiple times in an offline architecture. In practice, this process is too time consuming for online profiling and alternate approaches are used, such as profiling for k% of the workload or predicting peak power from similar workloads. We studied the effectiveness of these methods for core scaling. Core scaling is a technique which executes threads on a subset of available cores, allowing unused cores to enter low-power operating modes. Schedulers can use core scaling to reduce peak power, but must have an accurate profile across potential settings for number of active cores in order to know when to make this decision. We devised an accurate, fast and adaptive approach to profile peak power under core scaling. Our approach uses short profiling runs to collect instantaneous power traces for a workload under each core scaling setting. The duration of profiling varies for each power trace and depends on the desired accuracy. Compared to k% profiling of peak power, our approach reduced the profiling duration by up to 93% while keeping accuracy within 3%.
Zhang H., Hoffmann H.
2016-03-25 citations by CoLab: 80 Abstract  
Power and thermal dissipation constrain multicore performance scaling. Modern processors are built such that they could sustain damaging levels of power dissipation, creating a need for systems that can implement processor power caps. A particular challenge is developing systems that can maximize performance within a power cap, and approaches have been proposed in both software and hardware. Software approaches are flexible, allowing multiple hardware resources to be coordinated for maximum performance, but software is slow, requiring a long time to converge to the power target. In contrast, hardware power capping quickly converges to the the power cap, but only manages voltage and frequency, limiting its potential performance. In this work we propose PUPiL, a hybrid software/hardware power capping system. Unlike previous approaches, PUPiL combines hardware's fast reaction time with software's flexibility. We implement PUPiL on real Linux/x86 platform and compare it to Intel's commercial hardware power capping system for both single and multi-application workloads. We find PUPiL provides the same reaction time as Intel's hardware with significantly higher performance. On average, PUPiL outperforms hardware by from 1:18-2:4 depending on workload and power target. Thus, PUPiL provides a promising way to enforce power caps with greater performance than current state-of-the-art hardware-only approaches.
Ilsche T., Hackenberg D., Graul S., Schone R., Schuchart J.
2015-12-01 citations by CoLab: 21 Abstract  
Energy efficiency is a key optimization goal for software and hardware in the High Performance Computing (HPC) domain. This necessitates sophisticated power measurement capabilities that are characterized by the key criteria (i) high sampling rates, (ii) measurement of individual components, (iii) well-defined accuracy, and (iv) high scalability. In this paper, we tackle the first three of these goals and describe the instrumentation of two high-end compute nodes with three different current measurement techniques: (i) Hall effect sensors, (ii) measuring shunts in extension cables and riser cards, and (iii) tapping into the voltage regulators. The resulting measurement data for components such as sockets, PCIe cards, and DRAM DIMMs is digitized at sampling rates from 7 kSa/s up to 500 kSa/s, enabling a fine-grained correlation between power usage and application events. The accuracy of all elements in the measurement infrastructure is studied carefully. Moreover, potential pitfalls in building custom power instrumentation are discussed. We raise the awareness for the properties of power measurements, as disregarding existing inaccuracies can lead to invalid conclusions regarding energy efficiency.
Huang S., Lang M., Pakin S., Fu S.
2015-11-15 citations by CoLab: 11 Abstract  
The recently introduced Intel Haswell processors implement major changes compared to their predecessors, especially with respect to power management. Haswell processors are used in the new-generation DOE NNSA tri-lab supercomputer, Trinity, hosted at Los Alamos National Laboratory. In this paper we measure and analyze a number of power-based parameter of Haswell that are of great importance for the energy consumption of applications. We study three HPC benchmarks, HPL, STREAM, FIRESTARTER and a hydrodynamics application, CLAMR. They are representative of workloads stressing different components of computers. Our experimental results show that real-time on-board power monitoring causes substantial power use if no optimization is performed; adapting P-states provides a cost-effective way to improve the power-performance of applications; enabling hyperthreading can significantly save energy by up to 96.3% for compute-bound applications; HPC applications should employ differentiated core affinity strategies in order to achieve the maximum power-performance. Moreover, we study the imbalance of sockets on a server in their power and energy use, and then propose approaches to mitigate such imbalance.
Hackenberg D., Schone R., Ilsche T., Molka D., Schuchart J., Geyer R.
2015-05-01 citations by CoLab: 149 Abstract  
The recently introduced Intel Xeon E5-1600 v3 and E5-2600 v3 series processors -- codenamed Haswell-EP -- implement major changes compared to their predecessors. Among these changes are integrated voltage regulators that enable individual voltages and frequencies for every core. In this paper we analyze a number of consequences of this development that are of utmost importance for energy efficiency optimization strategies such as dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT). This includes the enhanced RAPL implementation and its improved accuracy as it moves from modeling to actual measurement. Another fundamental change is that every clock speed above AVX frequency -- including nominal frequency -- is opportunistic and unreliable, which vastly decreases performance predictability with potential effects on scalability. Moreover, we characterize significantly changed p-state transition behavior, and determine crucial memory performance data.
Manousakis I., Zakkak F.S., Pratikakis P., Nikolopoulos D.S.
2015-03-01 citations by CoLab: 8 Abstract  
We present TProf , an energy profiling tool for OpenMP-like task-parallel programs. To compute the energy consumed by each task in a parallel application, TProf dynamically traces the parallel execution and uses a novel technique to estimate the per-task energy consumption. To achieve this estimation, TProf apportions the total processor energy among cores and overcomes the limitation of current works which would otherwise make parallel accounting impossible to achieve. We demonstrate the value of TProf by characterizing a set of task parallel programs, where we find that data locality, memory access patterns and task working sets are responsible for significant variance in energy consumption between seemingly homogeneous tasks. In addition, we identify opportunities for fine-grain energy optimization by applying per-task Dynamic Voltage and Frequency Scaling (DVFS).
Diouri M.E., Dolz M.F., Glück O., Lefèvre L., Alonso P., Catalán S., Mayo R., Quintana-Ortí E.S.
2014-06-01 citations by CoLab: 11 Abstract  
Large-scale distributed systems (e.g., datacenters, HPC systems, clouds, large-scale networks, etc.) consume and will consume enormous amounts of energy. Therefore, accurately monitoring the power dissipation and energy consumption of these systems is more unavoidable. The main novelty of this contribution is the analysis and evaluation of different external and internal power monitoring devices tested using two different computing systems, a server and a desktop machine. Furthermore, we provide experimental results for a variety of benchmarks which intensively exercise the main components (CPU, Memory, HDDs, and NICs) of the target platforms to validate the accuracy of the equipment in terms of power dissipation and energy consumption. On the other hand, we also evaluate three different power measurement interfaces available on current architecture generations. Thanks to the high sampling rate and to the different measured lines, the internal wattmeters allow an improved visualization of some power fluctuations. However, a high sampling rate is not always necessary to understand the evolution of the power consumption during the execution of a benchmark.
Mobius C., Dargie W., Schill A.
2014-06-01 citations by CoLab: 119 Abstract  
The power consumption of presently available Internet servers and data centers is not proportional to the work they accomplish. The scientific community is attempting to address this problem in a number of ways, for example, by employing dynamic voltage and frequency scaling, selectively switching off idle or underutilized servers, and employing energy-aware task scheduling. Central to these approaches is the accurate estimation of the power consumption of the various subsystems of a server, particularly, the processor. We distinguish between power consumption measurement techniques and power consumption estimation models. The techniques refer to the art of instrumenting a system to measure its actual power consumption whereas the estimation models deal with indirect evidences (such as information pertaining to CPU utilization or events captured by hardware performance counters) to reason about the power consumption of a system under consideration. The paper provides a comprehensive survey of existing or proposed approaches to estimate the power consumption of single-core as well as multicore processors, virtual machines, and an entire server.
Servat H., Llort G., Giménez J., Labarta J.
2013-12-13 citations by CoLab: 4
Patki T., Lowenthal D.K., Rountree B., Schulz M., de Supinski B.R.
2013-06-10 citations by CoLab: 103 Abstract  
Most recent research in power-aware supercomputing has focused on making individual nodes more efficient and measuring the results in terms of flops per watt. While this work is vital in order to reach exascale computing at 20 megawatts, there has been a dearth of work that explores efficiency at the whole system level. Traditional approaches in supercomputer design use worst-case power provisioning: the total power allocated to the system is determined by the maximum power draw possible per node. In a world where power is plentiful and nodes are scarce, this solution is optimal. However, as power becomes the limiting factor in supercomputer design, worst-case provisioning becomes a drag on performance.
Subramaniam B., Feng W.
2013-04-21 citations by CoLab: 30 Abstract  
Massive data centers housing thousands of computing nodes have become commonplace in enterprise computing, and the power consumption of such data centers is growing at an unprecedented rate. Adding to the problem is the inability of the servers to exhibit energy proportionality, i.e., provide energy-efficient execution under all levels of utilization, which diminishes the overall energy efficiency of the data center. It is imperative that we realize effective strategies to control the power consumption of the server and improve the energy efficiency of data centers. With the advent of Intel Sandy Bridge processors, we have the ability to specify a limit on power consumption during runtime, which creates opportunities to design new power-management techniques for enterprise workloads and make the systems that they run on more energy proportional.In this paper, we investigate whether it is possible to achieve energy proportionality for an enterprise-class server workload, namely SPECpower_ssj2008 benchmark, by using Intel's Running Average Power Limit (RAPL) interfaces. First, we analyze the power consumption and characterize the instantaneous power profile of the SPECpower benchmark within different subsystems using the on-chip energy meters exposed via the RAPL interfaces. We then analyze the impact of RAPL power limiting on the performance, per-transaction response time, power consumption, and energy efficiency of the benchmark under different load levels. Our observations and results shed light on the efficacy of the RAPL interfaces and provide guidance for designing power-management techniques for enterprise-class workloads.
Hackenberg D., Ilsche T., Schone R., Molka D., Schmidt M., Nagel W.E.
2013-04-01 citations by CoLab: 97 Abstract  
Energy efficiency is of steadily growing importance in virtually all areas from mobile to high performance computing. Therefore, lots of research projects focus on this topic and strongly rely on power measurements from their test platforms. The need for finer grained measurement data-both in terms of temporal and spatial resolution (component breakdown)-often collides with very rudimentary measurement setups that rely e.g., on non-professional power meters, IMPI based platform data or model-based interfaces such as RAPL or APM. This paper presents an in-depth study of several different AC and DC measurement methodologies as well as model approaches on test systems with the latest processor generations from both Intel and AMD. We analyze most important aspects such as signal quality, time resolution, accuracy, and measurement overhead and use a calibrated, professional power analyzer as our reference.
Dongarra J., Ltaief H., Luszczek P., Weaver V.M.
2012-11-01 citations by CoLab: 34 Abstract  
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving over determined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the Power Pack framework and the Running Average Power Limit model, respectively.
Natori K., Fujimoto K., Shiraga A.
2025-04-03 citations by CoLab: 0 Abstract  
Abstract Services that require low latency are increasing, and edge computing that processes workloads in servers located geographically close to the user is being researched. To offload workloads from user devices to edge servers with low latency, packets must be forwarded with low latency on a general-purpose server, and performance-oriented methods are widely used at the expense of higher power consumption, including a busy-polling method in receiving packets, such as the Data Plane Development Kit (DPDK). However, in today’s large-scale services with many servers, even a slight increase in power consumption on each server results in wasting tremendous power. In this paper, we design and implement a packet-processing system on a general-purpose server that can achieve power saving while maintaining low latency. To avoid wasting power caused by busy polling, a receiving thread in the proposed system can sleep when no packet arrives and be woken up without delays by a hardware interrupt context of packet incoming. In addition, to enhance the power-saving effect of sleep, we design and implement a CPU idle control method that enables CPU cores used by receiving threads to enter an appropriate C-state in accordance with traffic load. We evaluate the proposed system in an environment that simulates a virtualized Radio Access Network (vRAN) system, which has strict latency requirements of network processing on a general-purpose server. The evaluation results demonstrate that the proposed system can reduce power consumption compared with a busy-polling system and the average latency degradation was only a few microseconds.
Cagigas-Muñiz D., Diaz-del-Rio F., Sevillano-Ramos J.L., Guisado-Lizar J.
2025-04-01 citations by CoLab: 0
Kabra M., Nadig R., Gupta H., Bera R., Frouzakis M., Arulchelvan V., Liang Y., Mao H., Sadrosadati M., Mutlu O.
2025-03-30 citations by CoLab: 0
Mercat A., Sainio J., Moan S.L., Herglotz C.
2025-03-01 citations by CoLab: 0
Jiang Y., Roy R.B., Kanakagiri R., Tiwari D.
2025-02-28 citations by CoLab: 0
Herglotz C., Katsenou A., Wang X., Kränzler M., Schien D.
IEEE Access scimago Q1 wos Q2 Open Access
2025-02-26 citations by CoLab: 0
Castiglione A., Loia V., Volpe A.
2025-02-16 citations by CoLab: 0 Abstract  
As the Noisy Intermediate-Scale Quantum era progresses, the threat of cyberattacks using large-scale quantum computers to decrypt TLS communication becomes feasible. Fortunately, multiple contributions from the cybersecurity community to the National Institute of Standards and Technology’s open call ensure the standardization of post-quantum algorithms that non-quantum devices can use to defend against such attacks. Various hardware and software implementations have been explored at each phase of the open call to identify potential threats and evaluate key performance metrics, such as CPU usage and RAM footprint. In this context, our research design and propose a Power Monitoring Framework that enables the monitoring of power constraints of an infrastructure that supports NIST’s digital signatures and key establishment mechanisms. The proposed framework enables the benchmarking of power consumption and related metrics in both classical and Post-Quantum real-world infrastructure, contributing to the exploration of the Post-Quantum era’s performance, requirements and constraints.
Alqurashi F., AL-Hashimi M., Saleh M., Abulnaja O.
Computers scimago Q2 wos Q2 Open Access
2025-02-16 citations by CoLab: 0 PDF Abstract  
Spectre variants 1 and 2 pose grave security threats to dynamic branch predictors in modern CPUs. While extensive research has focused on mitigating these attacks, little attention has been given to their energy and power implications. This study presents an empirical analysis of how compiler-based Spectre mitigation techniques influence energy consumption. We collect fine-grained energy readings from an HPC-class CPU via embedded sensors, allowing us to quantify the trade-offs between security and power efficiency. By utilizing a standard suite of microbenchmarks, we evaluate the impact of Spectre mitigations across three widely used compilers, comparing them to a no-mitigation baseline. The results show that energy consumption varies significantly depending on the compiler and workload characteristics. Loop unrolling influences power consumption by altering branch distribution, while speculative execution, when unrestricted, plays a role in conserving energy. Since Spectre mitigations inherently limit speculative execution, they should be applied selectively to vulnerable code patterns to optimize both security and power efficiency. Unlike previous studies that primarily focus on security effectiveness, this work uniquely evaluates the energy costs associated with Spectre mitigations at the compiler level, offering insights for power-efficient security strategies. Our findings underscore the importance of tailoring mitigation techniques to application needs, balancing performance, energy consumption, and security. The study provides practical recommendations for compiler developers to build more secure and energy-efficient software.
Ostapenco V., Lefèvre L., Orgerie A., Fichel B.
2025-02-14 citations by CoLab: 0 Abstract  
Data centers are very energy-intensive facilities whose power provision is challenging and constrained by power bounds. In modern data centers, servers account for a significant portion of the total power consumption. In this context, the ability to limit the instant power consumption of an individual computing node is an important requirement. There are several energy and power capping techniques that can be used to limit compute node power consumption, such as Intel RAPL. Although it is nowadays mainly utilized for energy measurement, Intel RAPL (Running Average Power Limit) was originally designed for power limitation purposes. Some works use Intel RAPL for power limitation in a limited context without full knowledge of the inner workings of this technology and what is done behind the scenes to enforce the power constraint. Furthermore, Intel has not revealed any details about its internal implementation. It is unclear exactly how Intel RAPL technology operates and what effects it has on application performance and power consumption. In this work, we conduct a thorough analysis of Intel RAPL technology as a power capping leverage on a variety of heterogeneous nodes for a selection of CPU and memory intensive workloads. For this purpose, we first validate Intel RAPL power capping mechanism using a high-precision external power meter and investigate properties such as accuracy, power limit granularity, and settling time. Then, we attempt to determine which mechanisms are employed by RAPL to adjust power consumption.
Ayyappan B., Santhoshkumar G.
2025-02-13 citations by CoLab: 0 Abstract  
This study introduces EdgeInsight, a robust system designed for effective performance monitoring in Edge computing environments. The system architecture is illustrated, based on an Edge Computing platform with System on Chip (SoC) technology. The SoC is equipped with integrated performance counters, enabling comprehensive monitoring and collection of performance parameters during computational workload execution. Computational workloads, tailored to specific use cases, operate within compute containers to address challenges related to library dependencies and facilitate seamless installation and migration. The core component of EdgeInsight, the lightweight agent, is deployed to track and monitor hardware and software events of computational workload in execution. Despite limitations imposed by SoC characteristics, the agent traces hardware and operating system events. Modern SoCs, featuring built-in sensors and performance counters, offer insights into internal system status from a power-and-performance perspective, covering energy consumption, temperature, and CPU components efficiency. With modern SoCs accommodating multiple cores on a single socket, EdgeInsight emerges as a versatile tool for modeling performance, energy usage, and temperature behavior in the dynamic realm of Edge computing.
De Moor F., Mognol M., Deltel C., Drezen E., Legriel J., Lavenier D.
2024-12-03 citations by CoLab: 0
Katsaros G.N., Filo M., Tafazolli R., Nikitopoulos K.
2024-12-01 citations by CoLab: 3
Rajput S., Widmayer T., Shang Z., Kechagia M., Sarro F., Sharma T.
2024-11-30 citations by CoLab: 0 Abstract  
With the increasing usage, scale, and complexity of Deep Learning ( dl ) models, their rapidly growing energy consumption has become a critical concern. Promoting green development and energy awareness at different granularities is the need of the hour to limit carbon emissions of dl systems. However, the lack of standard and repeatable tools to accurately measure and optimize energy consumption at fine granularity (e.g., at the api level) hinders progress in this area. This paper introduces FECoM (Fine-grained Energy Consumption Meter) , a framework for fine-grained dl energy consumption measurement. FECoM enables researchers and developers to profile dl api s from energy perspective. FECoM addresses the challenges of fine-grained energy measurement using static instrumentation while considering factors such as computational load and temperature stability. We assess FECoM ’s capability for fine-grained energy measurement for one of the most popular open-source dl frameworks, namely TensorFlow . Using FECoM , we also investigate the impact of parameter size and execution time on energy consumption, enriching our understanding of TensorFlow api s’ energy profiles. Furthermore, we elaborate on the considerations and challenges while designing and implementing a fine-grained energy measurement tool. This work will facilitate further advances in dl energy measurement and the development of energy-aware practices for dl systems.
Xavier J.A., Muriedas J.P., Nassyr S., Sedona R., Götz M., Streit A., Riedel M., Cavallaro G.
IEEE Access scimago Q1 wos Q2 Open Access
2024-11-27 citations by CoLab: 0

Top-30

Journals

2
4
6
8
10
12
14
2
4
6
8
10
12
14

Publishers

10
20
30
40
50
60
70
10
20
30
40
50
60
70
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?