Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Rajagopal Subramaniyan; Eric Grobelny; Scott Studham; Alan D. George

doi:10.1007/s11227-007-0162-0

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Rajagopal Subramaniyan, Eric Grobelny, Scott Studham, Alan D. George

Information Technology

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

Original language	English (US)
Pages (from-to)	150-180
Number of pages	31
Journal	Journal of Supercomputing
Volume	46
Issue number	2
DOIs	https://doi.org/10.1007/s11227-007-0162-0
State	Published - Nov 1 2008

Keywords

Checkpointing
Distributed computing
Fault tolerance
High-performance computing
Modeling
Parallel computing
Supercomputing
Technology growth

Access

10.1007/s11227-007-0162-0

OpenUrl availability

Full text

Cite this

@article{2d251ce552634e7787d6d75216039e2f,

title = "Optimization of checkpointing-related I/O for high-performance parallel and distributed computing",

abstract = "Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.",

keywords = "Checkpointing, Distributed computing, Fault tolerance, High-performance computing, Modeling, Parallel computing, Supercomputing, Technology growth",

author = "Rajagopal Subramaniyan and Eric Grobelny and Scott Studham and George, {Alan D.}",

year = "2008",

month = nov,

day = "1",

doi = "10.1007/s11227-007-0162-0",

language = "English (US)",

volume = "46",

pages = "150--180",

journal = "Journal of Supercomputing",

issn = "0920-8542",

publisher = "Springer Netherlands",

number = "2",

}

TY - JOUR

T1 - Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

AU - Subramaniyan, Rajagopal

AU - Grobelny, Eric

AU - Studham, Scott

AU - George, Alan D.

PY - 2008/11/1

Y1 - 2008/11/1

N2 - Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

AB - Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

KW - Checkpointing

KW - Distributed computing

KW - Fault tolerance

KW - High-performance computing

KW - Modeling

KW - Parallel computing

KW - Supercomputing

KW - Technology growth

UR - http://www.scopus.com/inward/record.url?scp=54149107334&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=54149107334&partnerID=8YFLogxK

U2 - 10.1007/s11227-007-0162-0

DO - 10.1007/s11227-007-0162-0

M3 - Article

AN - SCOPUS:54149107334

SN - 0920-8542

VL - 46

SP - 150

EP - 180

JO - Journal of Supercomputing

JF - Journal of Supercomputing

IS - 2

ER -

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this