BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Advances in missing data imputation for large social surveys

Moderator: Kyle Lang (
Slack link
Quick Zoom

Detailed zoom login information
Friday 27th November, 10:00 - 11:30 (ET, GMT-5)
7:00 - 8:30 (PT, GMT-8)
16:00 - 17:30 (CET, GMT+1)

Multiple imputation for high-dimensional data: A resampling study comparing state-of-the-art methods

Mr Edoardo Costantini (Tilburg University) - Presenting Author
Dr Kyle Lang (Tilburg University)

Missing data afflict large social surveys just like any other data collection endeavor, and the severity of the issue is exacerbated by attrition. Hence, researchers working with these data need tools to correct for the bias introduced by nonresponses. One of the most popular principled missing data treatments, multiple imputation (MI), is challenged by computational limitations when applied to such datasets. The large number of items recorded—coupled with the longitudinal nature of these surveys and the necessity of preserving complex interactions and non-linear relations—easily produces high-dimensional (p>n) imputation problems.

We performed a thorough review of the high-dimensional prediction literature to find the most promising, extant MI methods. We found that principal component regression, classification and regression trees, ensemble learning, and regularized regression, have all been used for MI—both in their frequentist and Bayesian versions. However, these MI algorithms have not yet been comprehensively compared in terms of their ability to support statistically valid analyses and inferences.

Using data from the European Values Study, we conduct a resampling study to investigate the performances of the state-of-the-art methods for missing data handling in incomplete high-dimensional data settings. Finally, we provide guidance on which methods social scientists should adopt when dealing with missing data in large social surveys, and provide future directions for developments in the field.

Hierarchical linear modeling with missing data using Blimp

Dr Brian Keller (University of Texas) - Presenting Author

There are a variety of ways to handle the clustering in large scale social science surveys, with hierarchical linear models (sometimes referred to as multilevel models) being one of them. While hierarchical linear models inherently handle missing data on the outcome, they require the predictors to be completely observed, causing researchers to often use listwise deletion. Appropriately handling missingness on the predictors poses a variety of challenges. For example, predictors often are a mix of different response types (e.g., continuous, binary, ordinal, or nominal variables), observed at different levels of the hierarchical linear model (e.g., at level-1, at level-2, etc.), and are related to the outcome in a nonlinear fashion (e.g., random coefficients, interactions, etc.). While most software packages struggle to solve all these issues, Blimp is a Bayesian hierarchical linear modeling and imputation package that can. The goal of Blimp is to provide free software that requires minimal specification to assist researchers in the estimation and imputation of hierarchical linear models.

Blimp models the distribution of variables using a multivariate hierarchical linear model, maintaining the between-cluster association using latent variables to represent each cluster's mean. By deriving the appropriate conditional distributions, Blimp is able to generate multiple imputations for a wide variety of clustered data structures with for normally distributed continuous, binary, ordinal, and nominal variables.

The real strength of Blimp lies in its ability for users to specify substantive hierarchical linear models that include random coefficients, interactions, and non-linearities. By specifying an analysis model, Blimp can be used as the primary analytic tool, providing summaries of the Bayesian posterior distributions for the substantive parameters of interest. This talk will illustrate the effectiveness of Blimp as a modeling framework by analyzing cross-level interactions from a smoking cessation survey. This application will demonstrate the challenges of missing data with nominal variables, random effects, and interactions. The example highlights the important features of Blimp, including automatic dummy coding and latent formulation of categorical variables, graphical and numerical convergence diagnostics, Bayesian conditional effects analysis (e.g., simple slope analysis), standardized regression coefficients, and effect sizes for clustered data.

Mass imputation based on combined administrative and survey data: An application to the Dutch population census

Dr Sander Scholtus (Statistics Netherlands) - Presenting Author
Dr Jacco Daalmans (Statistics Netherlands)

Download presentation

Since 1981, the Dutch population census has been a virtual census, where the census tables are estimated by re-using data from existing sources rather than collecting data with a dedicated questionnaire. In the Netherlands, most variables required for the census are available from administrative data that cover the target population. An exception occurs for the variable educational attainment, which is available partly from incomplete registers and partly from the Labour Force Survey. In 2001 and 2011, a repeated weighting method was used to handle missing data during estimation of census tables. It is known that repeated weighting has certain practical limitations when estimating large numbers of high-dimensional frequency tables, and these limitations were indeed encountered during the 2011 census.

For the next virtual population census in 2021, a mass-imputation method has been proposed to predict all missing values of educational attainment in the population. An imputation model has been developed based on logistic regression, which takes into account the ordinal nature of educational attainment. After mass imputation, tables can be estimated by a straightforward aggregation of the imputed microdata.

Since mass imputation involves predicting the missing values, the resulting tables have some estimation uncertainty. Before publication, the accuracy of the estimated values in the census tables needs to be assessed. We discuss two methods for evaluating the variance of estimated frequency tables based on mass imputation: an analytical variance approximation and a finite-population bootstrap method. Both approaches are compared in a simulation study on artificial data and in an application to real data of the Dutch Census of 2011.

Sequential imputation with integrated model selection: A novel approach to missing value imputation in high-dimensional (survey) data

Mr Micha Fischer (University of Michigan) - Presenting Author

Download presentation

The issue of incomplete observations due to various reasons (i.e. item nonresponse, unit nonresponse, failure to link records, and panel attrition) is an inevitable problem in survey data sets. Multiple sequential imputation is often used to impute those missing values. However, in data sets where many variables are affected by missing values, appropriate specifications of sequential regression models can be burdensome and time consuming, as a separate model needs to be developed by a human imputer for each incomplete variable. This task is even more complex, because survey data typically consists of many different kinds of variables (i.e. continuous, binary, and multi-categorical) with possibly nontrivial and non-linear relationships. Available software packages for automated imputation procedures (e.g. MICE, IVEware) require model specifications for each variable containing missing values. Additionally, default models in this software can lead to bias in imputed values, for example when variables are non-normally distributed.

This research aims to automate the process of sequential imputation of missing values in high-dimensional data sets consisting of potentially non-normally distributed variables and potentially complex and non-linear interactions. To achieve this goal, we propose modifying the sequential imputation procedure. First, model specification via an automated variable selection procedure (e.g. adaptive LASSO, elastic net) is performed. Second, the process carries out model selection from a pool of several parametric and non-parametric models in each step of the sequential imputation procedure.
The model selection is based on prediction accuracy and the similarity between imputed and observed values conditional on the response propensity score for the outcome variable.

A simulation study based on a survey data set (NHANES) investigates in which situations this automated procedure can outperform approaches implemented in the currently available software (MICE, IVEware). Evaluation of the proposed method focuses on two different aspects: 1) differences of quantitative properties of hypothetical models of interest fit on the imputed data are used to compare the accuracy of the different imputation procedures, 2) the methods are assessed in terms of run time.