Variance estimation by multivariate imputation methods in complex survey designs

Jong Min Kim; Kee Jae Lee; Wonkuk Kim

doi:10.3233/MAS-170394

Variance estimation by multivariate imputation methods in complex survey designs

Jong Min Kim, Kee Jae Lee, Wonkuk Kim

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

In this paper, we consider variance estimation of the sample mean when the missing data have been imputed with multivariate imputation methods. Modern multivariate imputation methods to missing data are complicated and computationally expensive. These multivariate imputation methods do not require the normality assumption to impute the missing values. Under this assumption free condition, we compare the performance of variance estimation of six modern multivariate imputation methods including copula imputation, random forest imputation, principal component analysis imputation, and k-nearest neighbors imputation methods in complex sampling designs such as stratified sampling, cluster sampling and resampling approach to variance estimation by jackknife and bootstrap methods in stratified sampling. We conducted simulation studies using National Health and Nutrition Survey data considering 5% and 15% missing completely at random (MCAR) rates. Based on our 500 times resampling simulation study of the mean squares errors of the sample mean in complex survey designs, the percent relative efficiency (RE(%)) of the random forest (RF) imputation method appears to outperform other imputation methods overall when the data has high skewness at the 5% missing rate and when the data has high excessive kurtosis at the 15% missing rate whereas the principal component analysis (PCA) imputation method appears to outperform other imputation methods when the data has high skewness at the 5% and 15% missing rates. Especially, the RE(%) of the multivariate imputation methods appears to be efficient in the cluster sampling design when the data has high skewness or excessive kurtosis at the 15% missing rate.

Original language	English (US)
Pages (from-to)	195-207
Number of pages	13
Journal	Model Assisted Statistics and Applications
Volume	12
Issue number	3
DOIs	https://doi.org/10.3233/MAS-170394
State	Published - 2017

Bibliographical note

Publisher Copyright:
© 2017 IOS Press and the authors.

Keywords

Missing at random (MAR)
bootstrap
copula imputation
jackknife

Access

10.3233/MAS-170394

OpenUrl availability

Full text

Cite this

@article{c35b0d7fe6f44aa9a88aa1941851e14c,

title = "Variance estimation by multivariate imputation methods in complex survey designs",

abstract = "In this paper, we consider variance estimation of the sample mean when the missing data have been imputed with multivariate imputation methods. Modern multivariate imputation methods to missing data are complicated and computationally expensive. These multivariate imputation methods do not require the normality assumption to impute the missing values. Under this assumption free condition, we compare the performance of variance estimation of six modern multivariate imputation methods including copula imputation, random forest imputation, principal component analysis imputation, and k-nearest neighbors imputation methods in complex sampling designs such as stratified sampling, cluster sampling and resampling approach to variance estimation by jackknife and bootstrap methods in stratified sampling. We conducted simulation studies using National Health and Nutrition Survey data considering 5% and 15% missing completely at random (MCAR) rates. Based on our 500 times resampling simulation study of the mean squares errors of the sample mean in complex survey designs, the percent relative efficiency (RE(%)) of the random forest (RF) imputation method appears to outperform other imputation methods overall when the data has high skewness at the 5% missing rate and when the data has high excessive kurtosis at the 15% missing rate whereas the principal component analysis (PCA) imputation method appears to outperform other imputation methods when the data has high skewness at the 5% and 15% missing rates. Especially, the RE(%) of the multivariate imputation methods appears to be efficient in the cluster sampling design when the data has high skewness or excessive kurtosis at the 15% missing rate.",

keywords = "Missing at random (MAR), bootstrap, copula imputation, jackknife",

author = "Kim, {Jong Min} and Lee, {Kee Jae} and Wonkuk Kim",

note = "Publisher Copyright: {\textcopyright} 2017 IOS Press and the authors.",

year = "2017",

doi = "10.3233/MAS-170394",

language = "English (US)",

volume = "12",

pages = "195--207",

journal = "Model Assisted Statistics and Applications",

issn = "1574-1699",

publisher = "IOS Press",

number = "3",

}

TY - JOUR

T1 - Variance estimation by multivariate imputation methods in complex survey designs

AU - Kim, Jong Min

AU - Lee, Kee Jae

AU - Kim, Wonkuk

PY - 2017

Y1 - 2017

N2 - In this paper, we consider variance estimation of the sample mean when the missing data have been imputed with multivariate imputation methods. Modern multivariate imputation methods to missing data are complicated and computationally expensive. These multivariate imputation methods do not require the normality assumption to impute the missing values. Under this assumption free condition, we compare the performance of variance estimation of six modern multivariate imputation methods including copula imputation, random forest imputation, principal component analysis imputation, and k-nearest neighbors imputation methods in complex sampling designs such as stratified sampling, cluster sampling and resampling approach to variance estimation by jackknife and bootstrap methods in stratified sampling. We conducted simulation studies using National Health and Nutrition Survey data considering 5% and 15% missing completely at random (MCAR) rates. Based on our 500 times resampling simulation study of the mean squares errors of the sample mean in complex survey designs, the percent relative efficiency (RE(%)) of the random forest (RF) imputation method appears to outperform other imputation methods overall when the data has high skewness at the 5% missing rate and when the data has high excessive kurtosis at the 15% missing rate whereas the principal component analysis (PCA) imputation method appears to outperform other imputation methods when the data has high skewness at the 5% and 15% missing rates. Especially, the RE(%) of the multivariate imputation methods appears to be efficient in the cluster sampling design when the data has high skewness or excessive kurtosis at the 15% missing rate.

AB - In this paper, we consider variance estimation of the sample mean when the missing data have been imputed with multivariate imputation methods. Modern multivariate imputation methods to missing data are complicated and computationally expensive. These multivariate imputation methods do not require the normality assumption to impute the missing values. Under this assumption free condition, we compare the performance of variance estimation of six modern multivariate imputation methods including copula imputation, random forest imputation, principal component analysis imputation, and k-nearest neighbors imputation methods in complex sampling designs such as stratified sampling, cluster sampling and resampling approach to variance estimation by jackknife and bootstrap methods in stratified sampling. We conducted simulation studies using National Health and Nutrition Survey data considering 5% and 15% missing completely at random (MCAR) rates. Based on our 500 times resampling simulation study of the mean squares errors of the sample mean in complex survey designs, the percent relative efficiency (RE(%)) of the random forest (RF) imputation method appears to outperform other imputation methods overall when the data has high skewness at the 5% missing rate and when the data has high excessive kurtosis at the 15% missing rate whereas the principal component analysis (PCA) imputation method appears to outperform other imputation methods when the data has high skewness at the 5% and 15% missing rates. Especially, the RE(%) of the multivariate imputation methods appears to be efficient in the cluster sampling design when the data has high skewness or excessive kurtosis at the 15% missing rate.

KW - Missing at random (MAR)

KW - bootstrap

KW - copula imputation

KW - jackknife

UR - http://www.scopus.com/inward/record.url?scp=85029451777&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029451777&partnerID=8YFLogxK

U2 - 10.3233/MAS-170394

DO - 10.3233/MAS-170394

M3 - Article

AN - SCOPUS:85029451777

SN - 1574-1699

VL - 12

SP - 195

EP - 207

JO - Model Assisted Statistics and Applications

JF - Model Assisted Statistics and Applications

IS - 3

ER -

Variance estimation by multivariate imputation methods in complex survey designs

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this