ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others

Blocher, M.; Coppa, E.; Kleber, P.; Eugster, P.; Culhane, W.; Ardekani, M. S.

doi:10.1145/3516430

Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current "online"approaches that store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top-k). The user can either provide minimal information that allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real-world data to experimentally demonstrate speedups up to 3× over single-level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.

ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others / Blocher, M., Coppa, E., Kleber, P., Eugster, P., Culhane, W., Ardekani, M.S.. - In: ACM TRANSACTIONS ON COMPUTER SYSTEMS. - ISSN 0734-2071. - 39:1-4(2022), pp. 1-33. [10.1145/3516430]

ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others

Blocher M.;Coppa E.^Co-primo;Kleber P.;Eugster P.;Culhane W.;Ardekani M. S.

2022

Abstract

Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current "online"approaches that store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top-k). The user can either provide minimal information that allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real-world data to experimentally demonstrate speedups up to 3× over single-level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2022
			
	Parole chiave
	
				big data aggregation overlay; distributed algorithms
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others / Blocher, M., Coppa, E., Kleber, P., Eugster, P., Culhane, W., Ardekani, M.S.. - In: ACM TRANSACTIONS ON COMPUTER SYSTEMS. - ISSN 0734-2071. - 39:1-4(2022), pp. 1-33. [10.1145/3516430]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Blocher_ROME_2022.pdf accesso aperto Note: https://doi.org/10.1145/3516430 Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.96 MB Formato Adobe PDF	1.96 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1659871

Citazioni

ND

0

0

Catalogo dei prodotti della ricerca