BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Applying Machine Learning and Automation to Improve Imputation - Replicate I

Chair Dr Steven Cohen (RTI International)
TimeSaturday 27th October, 14:00 - 15:30
Room: 40.010

The Enigma of Survey Research in the Digital Age - A Paradigm

Dr Mansour Fahimi (GfK) - Presenting Author

The times, they are a-changing. Traditional methods of survey sampling require a number of fundamental, yet increasingly unattainable, assumptions. For example, without a complete sampling frame and a data collection protocol that can secure high response rates, the available inferential machinery can break down. On the other hand, with the proliferation of Digital/Big data, researchers are compelled to look for supplementation of structured surveys with data from nontraditional sources to address clients’ evolving informational needs. These include point of purchase behavioral data that can be passively traced in real time, as well as layers of ancillary and modeled data that can be fused with survey data from commercial sources. The allure of these alternatives is further magnified when, in addition to limitations, the escalating costs and time requirements of the traditional methods of data collection are kept in balance.

However, among the most pressing shortfalls of data obtained from the emerging sources are their undeterminable representational properties and overall quality. On the one hand, digital tracing cannot be implemented in all cases, and on the other, fusion of ancillary data oftentimes involves presumptuous extrapolations at high levels of aggregation using a limited number of linkage variables. Without remedial measures that can render such data representative of a known target population, or at least gauge their coverage issues, even sophisticated data mining and analytical techniques can fail to extract measurable inferences from large heaps of data.

Based on decades of hands-on experience with survey and market research projects, as well as ongoing collaborations with private and public sector clients, the author will provide an overview of the growing divide between their current informational needs and what traditional methods of survey research can offer. He will then highlight some of the hazards associated with inferences based on untreated data with unknown coverage properties. Moreover, he will discuss robust weighting and calibration methodologies that can help reduce some of the inherent biases associated with blending survey data with those from other sources. This presentation will conclude with some contemplations about the new directions in survey research in the digital age.


Mass Imputation Combining Information From Big Data

Dr Shu Yang (North Carolina State University)
Professor Jae Kwang Kim (Iowa State University) - Presenting Author

Download presentation

Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass imputation is very attractive in the context of data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration with big data. The theory and methods for mass imputation are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimator outperforms existing competitors in terms of robustness and efficiency.


An Imputation Solution for Differentiating Between Unreported Attitudes and Genuine Nonattitudes in Survey Data

Dr Natalie Jackson (JUST Capital) - Presenting Author
Dr Jeff Gill (American University)

Missing data in surveys threaten the representativeness of a survey, as well as its ability to be combined with other data sources. Yet the problem of item nonresponse in survey work is often addressed by treating “don’t know” or nonattitude responses as missing values and dropping them from analysis with case wise (list wise) deletion. There are two problems with this approach: (1) We know that case wise deletion is the wrong way to deal with unrecorded data unless it is missing completely at random (not conditional on other data, observed or unobserved). Otherwise, statistical principles dictate that we should use some form of imputation. Imputation, though, implies that these respondents actually have attitudes on the questions but have declined to state them, leading to the second issue: (2) We do not know whether non-substantive responses are true nonattitudes or the respondent is choosing not to reveal an existing attitude. In this work we demonstrate first that nonattitudes and “don’t know” responses are not random, but rather come from a distinct group of survey respondents. This is shown by modeling relevant missingness as a dichotomous outcome variable explained by various characteristics, including demographic attributes, other attitudinal questions, and group level contexts. This model allows us to produce an imputational model to predict missingness due to ignorance versus intransigence. We use these newly generated "data" as part of the survey analysis, using the appropriate statistical treatment of the coefficient variability, to produce estimates that are not plagued by case wise deletion or fictitious attitudes generated by imputation. Our results demonstrate that this approach is useful for a wide range of survey research, including pre-election polls and non-political surveys, and constitutes a substantial improvement in the treatment of item nonresponse.


Can Missing Patterns in Covariates Improve Imputation for Missing Data?

Mr Micha Fischer (University of Michigan) - Presenting Author
Ms Felicitas Mittereder (University of Michigan)

Download presentation

Item nonresponse, especially on sensitive questions like income, is a common problem in survey research. This problem is usually tackled with different techniques like complete case analyses or imputation. In this paper, we explore a new approach using tree-based methods to impute missing values in survey data.

When imputing for item nonresponse, survey researchers usually assume complete covariates or impute the missing values within those covariates first (i.e., sequential imputation steps). This method can be problematic in two ways: first, if the imputation models in the sequential-imputation steps are over-simplified, we might introduce imputation bias to survey estimates. Second, we lose the information that the respondent chose not to answer the question. Assuming Not Missing At Random (NMAR), this information by itself might be important for the imputation model. Thus, the item missing pattern of respondents can be informative to predict the variable of interest.
By including ‘missingness’ as its own category within the imputation covariates we could improve imputation accuracy. Tree-based methods can incorporate this additional information and account for complex interactions in the covariates at the same time. In a simulation study, we investigate in which situations this procedure can outperform the usual approaches.

Furthermore, we apply this method to survey data with validation data from administrative records and investigate how much the new technique can improve standard imputation methods.