Scheduling strategies for parallel and distributed computing have mostly been oriented toward performance, while striving to achieve some notion of fairness. With the increase in size, complexity, and heterogeneity of today's computing environments, we argue that, in addition to performance metrics, scheduling algorithms should be designed for robustness. That is, they should have the ability to maintain performance under a wide variety of operating conditions. Although robustness is easy to define, there are no widely used metrics for this property. To this end, we present a methodology for characterizing and measuring the robustness of a system to a specific disturbance. The methodology is easily applied to many types of computing systems and it does not require sophisticated mathematical models. To illustrate its use, we show three applications of our technique to job scheduling; one supporting a previous result with respect to backfilling, one examining overload control in a streaming video server, and one comparing two different scheduling strategies for a distributed network service. The last example also demonstrates how consideration of robustness leads to better system design as we were able to devise a new and effective scheduling heuristic.
|Original language||English (US)|
|Number of pages||9|
|Journal||Proceedings of the IEEE International Symposium on High Performance Distributed Computing|
|State||Published - Nov 10 2005|
|Event||14th IEEE International Symposium on High Performance Distributed Computing, HPDC-14 - Research Triangle Park, NC, United States|
Duration: Jul 24 2005 → Jul 27 2005