Scalable Parallel Data Mining for Association Rules

Eui Hong Han; George Karypis; Vipin Kumar

doi:10.1145/253262.253330

Scalable Parallel Data Mining for Association Rules

Eui Hong Han, George Karypis, Vipin Kumar

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

164 Scopus citations

Abstract

One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user difined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candicate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

Original language	English (US)
Pages (from-to)	277-288
Number of pages	12
Journal	SIGMOD Record (ACM Special Interest Group on Management of Data)
Volume	26
Issue number	2
DOIs	https://doi.org/10.1145/253262.253330
State	Published - Jun 1997

Access

10.1145/253262.253330

OpenUrl availability

Full text

Cite this

@article{75495ac2447144459363eb979d88a05e,

title = "Scalable Parallel Data Mining for Association Rules",

abstract = "One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user difined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candicate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.",

author = "Han, {Eui Hong} and George Karypis and Vipin Kumar",

year = "1997",

month = jun,

doi = "10.1145/253262.253330",

language = "English (US)",

volume = "26",

pages = "277--288",

journal = "SIGMOD Record (ACM Special Interest Group on Management of Data)",

issn = "0163-5808",

publisher = "Association for Computing Machinery (ACM)",

number = "2",

}

TY - JOUR

T1 - Scalable Parallel Data Mining for Association Rules

AU - Han, Eui Hong

AU - Karypis, George

AU - Kumar, Vipin

PY - 1997/6

Y1 - 1997/6

N2 - One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user difined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candicate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

AB - One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user difined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candicate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

UR - http://www.scopus.com/inward/record.url?scp=0031165409&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0031165409&partnerID=8YFLogxK

U2 - 10.1145/253262.253330

DO - 10.1145/253262.253330

M3 - Article

AN - SCOPUS:0031165409

SN - 0163-5808

VL - 26

SP - 277

EP - 288

JO - SIGMOD Record (ACM Special Interest Group on Management of Data)

JF - SIGMOD Record (ACM Special Interest Group on Management of Data)

IS - 2

ER -

Scalable Parallel Data Mining for Association Rules

Abstract

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this