TY - JOUR
T1 - The feasible solution algorithm for the minimum covariance determinant estimator in multivariate data
AU - Hawkins, Douglas M.
PY - 1994/2
Y1 - 1994/2
N2 - Outliers in multivariate data present a particularly intractable problem. There are various criteria for their identification, and for the related problem of high breakdown robust estimation of multivariate location and scale, but to date no reliable algorithm has been proposed for fitting by these criteria. The most widely used multivariate high breakdown method is the minimum volume ellipsoid (MVE) approach which seeks the ellipsoid of minimum volume containing at least half of the data. The minimum covariance determinant (MCD) approach uses instead the sample mean vector and covariance matrix of a fixed-size subset of the data, the cases in the subset being chosen to minimize the determinant. The MCD approach has better theoretical properties than the MVE, but appears not to be widely used at present, partly no doubt for the lack of an accepted algorithm for finding it. In this paper, we present an algorithm for obtaining the MCD estimator for a given data set. The algorithm is probabilistic; it involves taking random starting 'trial solutions' and refining each to a local optimum satisfying the necessary condition for the MCD optimum. The method's convergence is guaranteed only asymptotically as the number of random starting values goes to infinity. We demonstrate however using simulated data and examples of real data from the literature that its practical performance is excellent: that in these data sets it reaches the global optimum with high probability using a very modest number of random starting points. This gives reason to expect that it will handle other real data sets equally effectively, providing a easily computed high breakdown multivariate estimator for use, both in its own right, and as a starting point for other robust procedures.
AB - Outliers in multivariate data present a particularly intractable problem. There are various criteria for their identification, and for the related problem of high breakdown robust estimation of multivariate location and scale, but to date no reliable algorithm has been proposed for fitting by these criteria. The most widely used multivariate high breakdown method is the minimum volume ellipsoid (MVE) approach which seeks the ellipsoid of minimum volume containing at least half of the data. The minimum covariance determinant (MCD) approach uses instead the sample mean vector and covariance matrix of a fixed-size subset of the data, the cases in the subset being chosen to minimize the determinant. The MCD approach has better theoretical properties than the MVE, but appears not to be widely used at present, partly no doubt for the lack of an accepted algorithm for finding it. In this paper, we present an algorithm for obtaining the MCD estimator for a given data set. The algorithm is probabilistic; it involves taking random starting 'trial solutions' and refining each to a local optimum satisfying the necessary condition for the MCD optimum. The method's convergence is guaranteed only asymptotically as the number of random starting values goes to infinity. We demonstrate however using simulated data and examples of real data from the literature that its practical performance is excellent: that in these data sets it reaches the global optimum with high probability using a very modest number of random starting points. This gives reason to expect that it will handle other real data sets equally effectively, providing a easily computed high breakdown multivariate estimator for use, both in its own right, and as a starting point for other robust procedures.
KW - High breakdown estimation
KW - Multivariate
KW - Outliers
UR - http://www.scopus.com/inward/record.url?scp=0000559699&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0000559699&partnerID=8YFLogxK
U2 - 10.1016/0167-9473(92)00071-X
DO - 10.1016/0167-9473(92)00071-X
M3 - Article
AN - SCOPUS:0000559699
SN - 0167-9473
VL - 17
SP - 197
EP - 210
JO - Computational Statistics and Data Analysis
JF - Computational Statistics and Data Analysis
IS - 2
ER -