Transactions on Architecture and Code Optimization, volume 11, issue 3, pages 1-25

Hardware Fault Recovery for I/O Intensive Applications

Pradeep Ramachandran 1
Siva Kumar Sastry Hari 2
Manlap Li 3
Sarita V. Adve 4
1
 
Intel Corporation, Sarjapur Outner Ring Road, Bangalore
2
 
NVIDIA, San Tomas Expy, Santa Clara, CA
3
 
Latham and Watkins LLP, San Francisco, CA
Publication typeJournal Article
Publication date2014-10-27
Q2
Q2
SJR0.628
CiteScore3.6
Impact factor1.5
ISSN15443566, 15443973
Hardware and Architecture
Information Systems
Software
Abstract

With continued process scaling, the rate of hardware failures in commodity systems is increasing. Because these commodity systems are highly sensitive to cost, traditional solutions that employ heavy redundancy to handle such failures are no longer acceptable owing to their high associated costs.

Detecting such faults by identifying anomalous software execution and recovering through checkpoint-and-replay is emerging as a viable low-cost alternative for future commodity systems. An important but commonly ignored aspect of such solutions is ensuring that external outputs to the system are fault-free. The outputs must be delayed until the detectors guarantee this, influencing fault-free performance. The overheads for resiliency must thus be evaluated while taking these delays into consideration; prior work has largely ignored this relationship.

This article concerns recovery for I/O intensive applications from in-core faults. We present a strategy to buffer external outputs using dedicated hardware and show that checkpoint intervals previously considered as acceptable incur exorbitant overheads when hardware buffering is considered. We then present two techniques to reduce the checkpoint interval and demonstrate a practical solution that provides high resiliency while incurring low overheads.

Found 
Found 

Top-30

Journals

1
Journal of Physics: Conference Series
1 publication, 14.29%
1

Publishers

1
2
3
4
5
Institute of Electrical and Electronics Engineers (IEEE)
5 publications, 71.43%
Association for Computing Machinery (ACM)
1 publication, 14.29%
IOP Publishing
1 publication, 14.29%
1
2
3
4
5
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
Share
Cite this
GOST |
Cite this
GOST Copy
Ramachandran P. et al. Hardware Fault Recovery for I/O Intensive Applications // Transactions on Architecture and Code Optimization. 2014. Vol. 11. No. 3. pp. 1-25.
GOST all authors (up to 50) Copy
Ramachandran P., Hari S. K. S., Li M., Adve S. V. Hardware Fault Recovery for I/O Intensive Applications // Transactions on Architecture and Code Optimization. 2014. Vol. 11. No. 3. pp. 1-25.
RIS |
Cite this
RIS Copy
TY - JOUR
DO - 10.1145/2656342
UR - https://doi.org/10.1145/2656342
TI - Hardware Fault Recovery for I/O Intensive Applications
T2 - Transactions on Architecture and Code Optimization
AU - Ramachandran, Pradeep
AU - Hari, Siva Kumar Sastry
AU - Li, Manlap
AU - Adve, Sarita V.
PY - 2014
DA - 2014/10/27
PB - Association for Computing Machinery (ACM)
SP - 1-25
IS - 3
VL - 11
SN - 1544-3566
SN - 1544-3973
ER -
BibTex |
Cite this
BibTex (up to 50 authors) Copy
@article{2014_Ramachandran,
author = {Pradeep Ramachandran and Siva Kumar Sastry Hari and Manlap Li and Sarita V. Adve},
title = {Hardware Fault Recovery for I/O Intensive Applications},
journal = {Transactions on Architecture and Code Optimization},
year = {2014},
volume = {11},
publisher = {Association for Computing Machinery (ACM)},
month = {oct},
url = {https://doi.org/10.1145/2656342},
number = {3},
pages = {1--25},
doi = {10.1145/2656342}
}
MLA
Cite this
MLA Copy
Ramachandran, Pradeep, et al. “Hardware Fault Recovery for I/O Intensive Applications.” Transactions on Architecture and Code Optimization, vol. 11, no. 3, Oct. 2014, pp. 1-25. https://doi.org/10.1145/2656342.
Found error?