Outliers in multivariate data present a particularly intractable problem. There are various criteria for their identification, and for the related problem of high breakdown robust estimation of multivariate location and scale, but to date no reliable algorithm has been proposed for fitting by these criteria. The most widely used multivariate high breakdown method is the minimum volume ellipsoid (MVE) approach which seeks the ellipsoid of minimum volume containing at least half of the data. The minimum covariance determinant (MCD) approach uses instead the sample mean vector and covariance matrix of a fixed-size subset of the data, the cases in the subset being chosen to minimize the determinant. The MCD approach has better theoretical properties than the MVE, but appears not to be widely used at present, partly no doubt for the lack of an accepted algorithm for finding it. In this paper, we present an algorithm for obtaining the MCD estimator for a given data set. The algorithm is probabilistic; it involves taking random starting 'trial solutions' and refining each to a local optimum satisfying the necessary condition for the MCD optimum. The method's convergence is guaranteed only asymptotically as the number of random starting values goes to infinity. We demonstrate however using simulated data and examples of real data from the literature that its practical performance is excellent: that in these data sets it reaches the global optimum with high probability using a very modest number of random starting points. This gives reason to expect that it will handle other real data sets equally effectively, providing a easily computed high breakdown multivariate estimator for use, both in its own right, and as a starting point for other robust procedures.
- High breakdown estimation