Fast attribute-based table clustering using Predicate-Trees: A vertical data mining approach

Arjun G. Roy; Arijit Chatterjee; Mohammad K. Hossain; William Perrizo

doi:10.3233/JCM-2012-0444

Fast attribute-based table clustering using Predicate-Trees: A vertical data mining approach

Arjun G. Roy, Arijit Chatterjee, Mohammad K. Hossain, William Perrizo

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.

Original language	English (US)
Pages (from-to)	139-146
Number of pages	8
Journal	Journal of Computational Methods in Sciences and Engineering
Volume	12
Issue number	SUPPL. 1
DOIs	https://doi.org/10.3233/JCM-2012-0444
State	Published - 2012
Externally published	Yes

Keywords

Data mining
Predicate Trees
big data
vertical data processing

Access

10.3233/JCM-2012-0444

OpenUrl availability

Full text

Cite this

@article{5657c646d2c145759f2b95aba0e7be6f,

title = "Fast attribute-based table clustering using Predicate-Trees: A vertical data mining approach",

abstract = "With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.",

keywords = "Data mining, Predicate Trees, big data, vertical data processing",

author = "Roy, {Arjun G.} and Arijit Chatterjee and Hossain, {Mohammad K.} and William Perrizo",

year = "2012",

doi = "10.3233/JCM-2012-0444",

language = "English (US)",

volume = "12",

pages = "139--146",

journal = "Journal of Computational Methods in Sciences and Engineering",

issn = "1472-7978",

publisher = "IOS Press",

number = "SUPPL. 1",

}

TY - JOUR

T1 - Fast attribute-based table clustering using Predicate-Trees

T2 - A vertical data mining approach

AU - Roy, Arjun G.

AU - Chatterjee, Arijit

AU - Hossain, Mohammad K.

AU - Perrizo, William

PY - 2012

Y1 - 2012

N2 - With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.

AB - With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.

KW - Data mining

KW - Predicate Trees

KW - big data

KW - vertical data processing

UR - http://www.scopus.com/inward/record.url?scp=84872355324&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872355324&partnerID=8YFLogxK

U2 - 10.3233/JCM-2012-0444

DO - 10.3233/JCM-2012-0444

M3 - Article

AN - SCOPUS:84872355324

SN - 1472-7978

VL - 12

SP - 139

EP - 146

JO - Journal of Computational Methods in Sciences and Engineering

JF - Journal of Computational Methods in Sciences and Engineering

IS - SUPPL. 1

ER -

Fast attribute-based table clustering using Predicate-Trees: A vertical data mining approach

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this