L2AP: Fast cosine similarity search with prefix L-2 norm bounds

David C. Anastasiu; George Karypis

doi:10.1109/ICDE.2014.6816700

L2AP: Fast cosine similarity search with prefix L-2 norm bounds

David C. Anastasiu, George Karypis

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

36 Scopus citations

Abstract

The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ²-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.

Original language	English (US)
Title of host publication	2014 IEEE 30th International Conference on Data Engineering, ICDE 2014
Publisher	IEEE Computer Society
Pages	784-795
Number of pages	12
ISBN (Print)	9781479925544
DOIs	https://doi.org/10.1109/ICDE.2014.6816700
State	Published - 2014
Event	30th IEEE International Conference on Data Engineering, ICDE 2014 - Chicago, IL, United States Duration: Mar 31 2014 → Apr 4 2014

Publication series

Name	Proceedings - International Conference on Data Engineering
ISSN (Print)	1084-4627

Other

Other	30th IEEE International Conference on Data Engineering, ICDE 2014
Country/Territory	United States
City	Chicago, IL
Period	3/31/14 → 4/4/14

Access

10.1109/ICDE.2014.6816700

OpenUrl availability

Full text

Cite this

L2AP: Fast cosine similarity search with prefix L-2 norm bounds. / Anastasiu, David C.; Karypis, George.
2014 IEEE 30th International Conference on Data Engineering, ICDE 2014. IEEE Computer Society, 2014. p. 784-795 6816700 (Proceedings - International Conference on Data Engineering).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Anastasiu, DC & Karypis, G 2014, L2AP: Fast cosine similarity search with prefix L-2 norm bounds. in 2014 IEEE 30th International Conference on Data Engineering, ICDE 2014., 6816700, Proceedings - International Conference on Data Engineering, IEEE Computer Society, pp. 784-795, 30th IEEE International Conference on Data Engineering, ICDE 2014, Chicago, IL, United States, 3/31/14. https://doi.org/10.1109/ICDE.2014.6816700

@inproceedings{993c66a97cc14d609318d6fd104cb6c2,

title = "L2AP: Fast cosine similarity search with prefix L-2 norm bounds",

abstract = "The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.",

author = "Anastasiu, {David C.} and George Karypis",

year = "2014",

doi = "10.1109/ICDE.2014.6816700",

language = "English (US)",

isbn = "9781479925544",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "784--795",

booktitle = "2014 IEEE 30th International Conference on Data Engineering, ICDE 2014",

}

TY - GEN

T1 - L2AP

T2 - 30th IEEE International Conference on Data Engineering, ICDE 2014

AU - Anastasiu, David C.

AU - Karypis, George

PY - 2014

Y1 - 2014

N2 - The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.

AB - The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.

UR - http://www.scopus.com/inward/record.url?scp=84901785372&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901785372&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2014.6816700

DO - 10.1109/ICDE.2014.6816700

M3 - Conference contribution

AN - SCOPUS:84901785372

SN - 9781479925544

T3 - Proceedings - International Conference on Data Engineering

SP - 784

EP - 795

BT - 2014 IEEE 30th International Conference on Data Engineering, ICDE 2014

PB - IEEE Computer Society

Y2 - 31 March 2014 through 4 April 2014

ER -

L2AP: Fast cosine similarity search with prefix L-2 norm bounds

Abstract

Publication series

Other

Access

OpenUrl availability

Other files and links

Cite this