Enhancing data analysis with noise removal

Hui Xiong, Gaurav Pandey, Michael S Steinbach, Vipin Kumar

Research output: Contribution to journalArticlepeer-review

175 Scopus citations

Abstract

Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.

Original languageEnglish (US)
Pages (from-to)304-319
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Volume18
Issue number3
DOIs
StatePublished - Mar 2006

Bibliographical note

Funding Information:
This work was partially supported by US NSF grant # IIS-0308264, US NSF grant # ACI-0325949, and by Army High Performance Computing Research Center under the auspices of the US Department of the Army, Army Research Laboratory cooperative agreement number DAAD19-01-2-0014. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute.

Keywords

  • Data cleaning
  • Hyperclique pattern discovery
  • Local outlier factor (LOF)
  • Noise removal
  • Very noisy data

Fingerprint

Dive into the research topics of 'Enhancing data analysis with noise removal'. Together they form a unique fingerprint.

Cite this