Efficient and Fast Approximate Consensus with Epidemic Failure Detection at Extreme Scale

Amogh Katti, David J. Lilja

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

This paper proposes a memory efficient failure detection and consensus algorithm, for fail-stop type process failures, based on epidemic protocols. It is suitable for extreme scale systems with reliable networks (no message loss) and high failure frequency. Communication time dominates the execution time at scale. The redundant failure detections and non-uniform information dissemination speed of epidemic algorithms make approximate epidemic-based consensus detection a useful way to trade communication overhead for accuracy. An approximate technique to the consensus detection is also proposed in this paper for faster consensus detection. Results show that the algorithm detects consensus correctly on failed processes with logarithmic scalability. The algorithm is tolerant to process failures both before and during the execution and the number of failures (occurring both before and during execution) have virtually no effect on the consensus detection time at scale. Comparison with similar deterministic consensus detection technique shows that the algorithm detects consensus at the same time with high probability. Further, benefits of the proposed approximate technique increase as system size increases. Compared to the non-approximate version, for a system size of 218 processes, the communication saved is 34% with accuracy loss of the order of 10^-4 in consensus detection.

Original languageEnglish (US)
Title of host publicationProceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018
EditorsIgor Kotenko, Ivan Merelli, Pietro Lio
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages267-272
Number of pages6
ISBN (Electronic)9781538649756
DOIs
StatePublished - Jun 6 2018
Event26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018 - Cambridge, United Kingdom
Duration: Mar 21 2018Mar 23 2018

Publication series

NameProceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018

Other

Other26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018
Country/TerritoryUnited Kingdom
CityCambridge
Period3/21/183/23/18

Keywords

  • Approximate consensus
  • Communication reduction
  • Consensus
  • Failure detection

Fingerprint

Dive into the research topics of 'Efficient and Fast Approximate Consensus with Epidemic Failure Detection at Extreme Scale'. Together they form a unique fingerprint.

Cite this