Scalable parallel data mining for association rules

Eui Hong Han; George Karypis; Vipin Kumar

doi:10.1109/69.846289

Scalable parallel data mining for association rules

Eui Hong Han, George Karypis, Vipin Kumar

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

140 Scopus citations

Abstract

In this paper, we propose two new parallel formulations of the Apriori algorithm that is used for computing association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel formulations CD and DD. Unlike the CD algorithm, the IDD algorithm partitions the candidate set intelligently among processors to efficiently parallelize the step of building the hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller communication overhead than DD. But IDD suffers from the added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales as well as IDD with respect to increasing candidate set size.

Original language	English (US)
Pages (from-to)	337-352
Number of pages	16
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	12
Issue number	3
DOIs	https://doi.org/10.1109/69.846289
State	Published - 2000

Bibliographical note

Funding Information:
This work was supported by National Science Foundation (NSF) grant ASC-9634719, Army Research Office contract DA/DAAH04-95-1-0538, the Army High-Performance Computing Research Center under auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-008, Cray Research Inc. Fellowship, and IBM partnership award, the content of which does not necessarily reflect the policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute, Cray Research Inc., and NSF grant CDA-9414015.

Access

10.1109/69.846289

OpenUrl availability

Full text

Cite this

@article{1f54a6a5e73d486796b5591b6d3165f4,

title = "Scalable parallel data mining for association rules",

abstract = "In this paper, we propose two new parallel formulations of the Apriori algorithm that is used for computing association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel formulations CD and DD. Unlike the CD algorithm, the IDD algorithm partitions the candidate set intelligently among processors to efficiently parallelize the step of building the hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller communication overhead than DD. But IDD suffers from the added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales as well as IDD with respect to increasing candidate set size.",

author = "Han, {Eui Hong} and George Karypis and Vipin Kumar",

note = "Funding Information: This work was supported by National Science Foundation (NSF) grant ASC-9634719, Army Research Office contract DA/DAAH04-95-1-0538, the Army High-Performance Computing Research Center under auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-008, Cray Research Inc. Fellowship, and IBM partnership award, the content of which does not necessarily reflect the policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute, Cray Research Inc., and NSF grant CDA-9414015.",

year = "2000",

doi = "10.1109/69.846289",

language = "English (US)",

volume = "12",

pages = "337--352",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "3",

}

TY - JOUR

T1 - Scalable parallel data mining for association rules

AU - Han, Eui Hong

AU - Karypis, George

AU - Kumar, Vipin

N1 - Funding Information: This work was supported by National Science Foundation (NSF) grant ASC-9634719, Army Research Office contract DA/DAAH04-95-1-0538, the Army High-Performance Computing Research Center under auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-008, Cray Research Inc. Fellowship, and IBM partnership award, the content of which does not necessarily reflect the policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute, Cray Research Inc., and NSF grant CDA-9414015.

PY - 2000

Y1 - 2000

N2 - In this paper, we propose two new parallel formulations of the Apriori algorithm that is used for computing association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel formulations CD and DD. Unlike the CD algorithm, the IDD algorithm partitions the candidate set intelligently among processors to efficiently parallelize the step of building the hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller communication overhead than DD. But IDD suffers from the added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales as well as IDD with respect to increasing candidate set size.

AB - In this paper, we propose two new parallel formulations of the Apriori algorithm that is used for computing association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel formulations CD and DD. Unlike the CD algorithm, the IDD algorithm partitions the candidate set intelligently among processors to efficiently parallelize the step of building the hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller communication overhead than DD. But IDD suffers from the added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales as well as IDD with respect to increasing candidate set size.

UR - http://www.scopus.com/inward/record.url?scp=0033685260&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0033685260&partnerID=8YFLogxK

U2 - 10.1109/69.846289

DO - 10.1109/69.846289

M3 - Article

AN - SCOPUS:0033685260

SN - 1041-4347

VL - 12

SP - 337

EP - 352

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 3

ER -

Scalable parallel data mining for association rules

Abstract

Bibliographical note

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this