Catalogo dei prodotti della ricerca

Nearest-neighbor searching systems are an integral part of many online applications, including but not limited to pattern recognition, plagiarism detection and recommender systems. With increasingly larger data sets, scalability has become an important issue. Many of the most space and running time efficient algorithms are based on locality sensitive hashing. The de facto standard approach to quickly answer nearest-neighbor queries on such a data set is usually a form of min-hashing. Not only is min-hashing very fast, but it is also space efficient and can be implemented in many computational models aimed at dealing with large data sets such as MapReduce and streaming. A significant drawback is that minhashing and related methods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality sensitive hashing (LSH) for dynamic data-streams. Specifically, using the Jaccard index as similarity measure, we design (1) a nearest-neighbor datastructure maintainable in dynamic data streams and (2) a sketching algorithm for similarity estimation. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion streams, and drastically outperform previous algorithms in case of deletions

Similarity Search for Dynamic Data Streams / Bury, Marc; Schwiegelshohn, Chris; Sorella, Mara. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - 32:11(2020), pp. 2241-2253. [10.1109/TKDE.2019.2916858]

Similarity Search for Dynamic Data Streams

Bury, Marc;Schwiegelshohn, Chris;Sorella, Mara

2020

Abstract

Nearest-neighbor searching systems are an integral part of many online applications, including but not limited to pattern recognition, plagiarism detection and recommender systems. With increasingly larger data sets, scalability has become an important issue. Many of the most space and running time efficient algorithms are based on locality sensitive hashing. The de facto standard approach to quickly answer nearest-neighbor queries on such a data set is usually a form of min-hashing. Not only is min-hashing very fast, but it is also space efficient and can be implemented in many computational models aimed at dealing with large data sets such as MapReduce and streaming. A significant drawback is that minhashing and related methods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality sensitive hashing (LSH) for dynamic data-streams. Specifically, using the Jaccard index as similarity measure, we design (1) a nearest-neighbor datastructure maintainable in dynamic data streams and (2) a sketching algorithm for similarity estimation. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion streams, and drastically outperform previous algorithms in case of deletions

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Parole chiave
	
				Dynamic Streaming; Locality-Sensitive Hashing; Nearest Neighbor Searching
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Similarity Search for Dynamic Data Streams / Bury, Marc; Schwiegelshohn, Chris; Sorella, Mara. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - 32:11(2020), pp. 2241-2253. [10.1109/TKDE.2019.2916858]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Bury_postprint_Similarity_2019.pdf accesso aperto Note: https://ieeexplore.ieee.org/document/8713878 Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 4.44 MB Formato Adobe PDF	4.44 MB	Adobe PDF
Bury_Similarity_2020.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.41 MB Formato Adobe PDF Contatta l'autore	1.41 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1399064

Citazioni

ND

5

5

social impact