Many applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hubandspoke model, where several edge servers perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limitedWAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed-i.e., stale-results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on Planet-Lab. Our experiments show that our proposed algorithms reduce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batchingbased aggregation algorithm across a diverse set of aggregation functions.
|Original language||English (US)|
|Title of host publication||Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016|
|Editors||Yanlei Diao, Marcos K. Aguilera, Brian Cooper, Yanlei Diao|
|Publisher||Association for Computing Machinery, Inc|
|Number of pages||13|
|State||Published - Oct 5 2016|
|Event||7th ACM Symposium on Cloud Computing, SoCC 2016 - Santa Clara, United States|
Duration: Oct 5 2016 → Oct 7 2016
|Name||Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016|
|Other||7th ACM Symposium on Cloud Computing, SoCC 2016|
|Period||10/5/16 → 10/7/16|
Bibliographical noteFunding Information:
The authors would like to acknowledge NSF Grant CNS-1413998, and an IBM Faculty Award, which supported this research.
- Geo-distributed systems
- Stream processing