TY - JOUR
T1 - Fast attribute-based table clustering using Predicate-Trees
T2 - A vertical data mining approach
AU - Roy, Arjun G.
AU - Chatterjee, Arijit
AU - Hossain, Mohammad K.
AU - Perrizo, William
PY - 2012
Y1 - 2012
N2 - With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.
AB - With technological advancements, massive amount of data is being collected in various domains. For instance, since the advent of digital image technology and remote sensing imagery (RSI), NASA and U.S. Geological Survey through the Landsat Data Continuity Mission, has been capturing images of Earth down to 15 meters resolution. Likewise, consider the Internet, where, growth of social media, blog Web sites , etc. generates exponential amount of textual data on a daily basis. Since clustering of data is time-consuming, much of these data is archived even before proper analysis. In this paper, we propose two novel and extremely fast algorithms called imgFAUST or Fast Attribute-based Unsupervised and Supervised Table Clustering for images and a variation called docFAUST for textual data. Both these algorithms are based on Predicate-Trees which are compressed, lossless and data-mining-ready data structures. Without compromising much on the accuracy, our algorithms are fast and can be effectively used in high-speed image data and document analysis.
KW - Data mining
KW - Predicate Trees
KW - big data
KW - vertical data processing
UR - http://www.scopus.com/inward/record.url?scp=84872355324&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84872355324&partnerID=8YFLogxK
U2 - 10.3233/JCM-2012-0444
DO - 10.3233/JCM-2012-0444
M3 - Article
AN - SCOPUS:84872355324
SN - 1472-7978
VL - 12
SP - 139
EP - 146
JO - Journal of Computational Methods in Sciences and Engineering
JF - Journal of Computational Methods in Sciences and Engineering
IS - SUPPL. 1
ER -