TY - GEN
T1 - Optimizing grouped aggregation in geo-distributed streaming analytics
AU - Heintz, Benjamin
AU - Chandra, Abhishek
AU - Sitaraman, Ramesh K.
N1 - Publisher Copyright:
© 2015 ACM.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2015/6/15
Y1 - 2015/6/15
N2 - Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geodistributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.
AB - Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geodistributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.
KW - Aggregation
KW - Geo-distributed systems
KW - Storm
KW - Stream processing
UR - http://www.scopus.com/inward/record.url?scp=84987704924&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84987704924&partnerID=8YFLogxK
U2 - 10.1145/2749246.2749276
DO - 10.1145/2749246.2749276
M3 - Conference contribution
AN - SCOPUS:84987704924
T3 - HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
SP - 133
EP - 144
BT - HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 24th ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015
Y2 - 15 June 2015 through 19 June 2015
ER -