Mining needles in a haystack: Classifying rare classes via two-phase rule induction

Mahesh V. Joshi, Ramesh C. Agarwal, Vipin Kumar

Research output: Contribution to journalArticlepeer-review

60 Scopus citations

Abstract

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and e rror-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.

Original languageEnglish (US)
Pages (from-to)91-102
Number of pages12
JournalSIGMOD Record (ACM Special Interest Group on Management of Data)
Volume30
Issue number2
StatePublished - Jun 2001
Event2001 ACM SIGMOD International Conference on Management of Data - Santa Barbara, CA, United States
Duration: May 21 2001May 24 2001

Fingerprint Dive into the research topics of 'Mining needles in a haystack: Classifying rare classes via two-phase rule induction'. Together they form a unique fingerprint.

Cite this