Computing in the RAIN: A reliable array of independent nodes?

Vasken Bohossian, Charles C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiplenode, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to RAINfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols pro viding consistent error reporting of link failures; 2) fault management techniques based on group membership; and 3) data storage schemes based on computationally effcient error-control codes. We present several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.

Original languageEnglish (US)
Title of host publicationParallel and Distributed Processing - 15 IPDPS 2000 Workshops, Proceedings
EditorsJose Rolim
PublisherSpringer Verlag
Pages1204-1213
Number of pages10
ISBN (Print)354067442X, 9783540674429
DOIs
StatePublished - 2000
Event15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000 - Cancun, Mexico
Duration: May 1 2000May 5 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1800 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000
Country/TerritoryMexico
CityCancun
Period5/1/005/5/00

Fingerprint

Dive into the research topics of 'Computing in the RAIN: A reliable array of independent nodes?'. Together they form a unique fingerprint.

Cite this