Estimating false negatives for classification problems with cluster structure

György J. Simon; Vipin Kumar; Zhi Li Zhang

doi:10.1137/1.9781611972771.68

Estimating false negatives for classification problems with cluster structure

György J. Simon, Vipin Kumar, Zhi Li Zhang

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of "strata", such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.

Original language	English (US)
Title of host publication	Proceedings of the 7th SIAM International Conference on Data Mining
Publisher	Society for Industrial and Applied Mathematics Publications
Pages	599-604
Number of pages	6
ISBN (Print)	9780898716306
DOIs	https://doi.org/10.1137/1.9781611972771.68
State	Published - 2007
Event	7th SIAM International Conference on Data Mining - Minneapolis, MN, United States Duration: Apr 26 2007 → Apr 28 2007

Publication series

Name	Proceedings of the 7th SIAM International Conference on Data Mining

Other

Other	7th SIAM International Conference on Data Mining
Country/Territory	United States
City	Minneapolis, MN
Period	4/26/07 → 4/28/07

Keywords

Clustering
False negatives estimation
Random sampling

Access

10.1137/1.9781611972771.68

OpenUrl availability

Full text

Cite this

Simon, G. J., Kumar, V., & Zhang, Z. L. (2007). Estimating false negatives for classification problems with cluster structure. In Proceedings of the 7th SIAM International Conference on Data Mining (pp. 599-604). (Proceedings of the 7th SIAM International Conference on Data Mining). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611972771.68

Estimating false negatives for classification problems with cluster structure. / Simon, György J.; Kumar, Vipin ; Zhang, Zhi Li.
Proceedings of the 7th SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics Publications, 2007. p. 599-604 (Proceedings of the 7th SIAM International Conference on Data Mining).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Simon, GJ , Kumar, V & Zhang, ZL 2007, Estimating false negatives for classification problems with cluster structure. in Proceedings of the 7th SIAM International Conference on Data Mining. Proceedings of the 7th SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics Publications, pp. 599-604, 7th SIAM International Conference on Data Mining, Minneapolis, MN, United States, 4/26/07. https://doi.org/10.1137/1.9781611972771.68

@inproceedings{4d96960d142d453a806f3f96171013b2,

title = "Estimating false negatives for classification problems with cluster structure",

abstract = "Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of {"}strata{"}, such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.",

keywords = "Clustering, False negatives estimation, Random sampling",

author = "Simon, {Gy{\"o}rgy J.} and Vipin Kumar and Zhang, {Zhi Li}",

year = "2007",

doi = "10.1137/1.9781611972771.68",

language = "English (US)",

isbn = "9780898716306",

series = "Proceedings of the 7th SIAM International Conference on Data Mining",

publisher = "Society for Industrial and Applied Mathematics Publications",

pages = "599--604",

booktitle = "Proceedings of the 7th SIAM International Conference on Data Mining",

note = "7th SIAM International Conference on Data Mining ; Conference date: 26-04-2007 Through 28-04-2007",

}

TY - GEN

T1 - Estimating false negatives for classification problems with cluster structure

AU - Simon, György J.

AU - Kumar, Vipin

AU - Zhang, Zhi Li

PY - 2007

Y1 - 2007

N2 - Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of "strata", such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.

AB - Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of "strata", such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.

KW - Clustering

KW - False negatives estimation

KW - Random sampling

UR - http://www.scopus.com/inward/record.url?scp=70449106940&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449106940&partnerID=8YFLogxK

U2 - 10.1137/1.9781611972771.68

DO - 10.1137/1.9781611972771.68

M3 - Conference contribution

AN - SCOPUS:70449106940

SN - 9780898716306

T3 - Proceedings of the 7th SIAM International Conference on Data Mining

SP - 599

EP - 604

BT - Proceedings of the 7th SIAM International Conference on Data Mining

PB - Society for Industrial and Applied Mathematics Publications

T2 - 7th SIAM International Conference on Data Mining

Y2 - 26 April 2007 through 28 April 2007

ER -

Estimating false negatives for classification problems with cluster structure

Abstract

Publication series

Other

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this