Mining low-support discriminative patterns from dense and high-dimensional data

Gang Fang; Gaurav Pandey; Wen Wang; Manish Gupta; Michael Steinbach; Vipin Kumar

doi:10.1109/TKDE.2010.241

Mining low-support discriminative patterns from dense and high-dimensional data

Gang Fang, Gaurav Pandey, Wen Wang, Manish Gupta, Michael Steinbach, Vipin Kumar

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

36 Scopus citations

Abstract

Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and data set for this paper are available at http://vk.cs.umn.edu/SMP/..

Original language	English (US)
Article number	5645630
Pages (from-to)	279-294
Number of pages	16
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	24
Issue number	2
DOIs	https://doi.org/10.1109/TKDE.2010.241
State	Published - 2012

Bibliographical note

Funding Information:
The authors would like to thank the anonymous reviewers for the constructive comments. This work was supported by US National Science Foundation (NSF) grants #IIS0916439, #CRI-0551551, a University of Minnesota Rochester Biomedical Informatics and Computational Biology Program Traineeship Award. Access to computing facilities was provided by the Minnesota Supercomputing Institute.

Keywords

Association analysis
biomarker discovery
discriminative pattern mining
permutation test

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1109/TKDE.2010.241

OpenUrl availability

Full text

Cite this

@article{3f034bd03a8e4607b0d3d93b42c38091,

title = "Mining low-support discriminative patterns from dense and high-dimensional data",

abstract = "Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and data set for this paper are available at http://vk.cs.umn.edu/SMP/..",

keywords = "Association analysis, biomarker discovery, discriminative pattern mining, permutation test",

author = "Gang Fang and Gaurav Pandey and Wen Wang and Manish Gupta and Michael Steinbach and Vipin Kumar",

note = "Funding Information: The authors would like to thank the anonymous reviewers for the constructive comments. This work was supported by US National Science Foundation (NSF) grants #IIS0916439, #CRI-0551551, a University of Minnesota Rochester Biomedical Informatics and Computational Biology Program Traineeship Award. Access to computing facilities was provided by the Minnesota Supercomputing Institute.",

year = "2012",

doi = "10.1109/TKDE.2010.241",

language = "English (US)",

volume = "24",

pages = "279--294",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "2",

}

TY - JOUR

T1 - Mining low-support discriminative patterns from dense and high-dimensional data

AU - Fang, Gang

AU - Pandey, Gaurav

AU - Wang, Wen

AU - Gupta, Manish

AU - Steinbach, Michael

AU - Kumar, Vipin

N1 - Funding Information: The authors would like to thank the anonymous reviewers for the constructive comments. This work was supported by US National Science Foundation (NSF) grants #IIS0916439, #CRI-0551551, a University of Minnesota Rochester Biomedical Informatics and Computational Biology Program Traineeship Award. Access to computing facilities was provided by the Minnesota Supercomputing Institute.

PY - 2012

Y1 - 2012

N2 - Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and data set for this paper are available at http://vk.cs.umn.edu/SMP/..

AB - Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and data set for this paper are available at http://vk.cs.umn.edu/SMP/..

KW - Association analysis

KW - biomarker discovery

KW - discriminative pattern mining

KW - permutation test

UR - http://www.scopus.com/inward/record.url?scp=84555170179&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84555170179&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2010.241

DO - 10.1109/TKDE.2010.241

M3 - Article

AN - SCOPUS:84555170179

SN - 1041-4347

VL - 24

SP - 279

EP - 294

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 2

M1 - 5645630

ER -

Mining low-support discriminative patterns from dense and high-dimensional data

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this