Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geodistributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.
|Original language||English (US)|
|Title of host publication||HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing|
|Publisher||Association for Computing Machinery, Inc|
|Number of pages||12|
|State||Published - Jun 15 2015|
|Event||24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015 - Portland, United States|
Duration: Jun 15 2015 → Jun 19 2015
|Name||HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing|
|Other||24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015|
|Period||6/15/15 → 6/19/15|
Bibliographical notePublisher Copyright:
© 2015 ACM.
- Geo-distributed systems
- Stream processing