BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Methods to Improve Survey Representativeness Using High Dimensional Data

Chair Professor Michael Elliott (University of Michigan)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.002

The explosion of "big data" in the 21st Century has led some researchers to declare the end of the traditional probability sample paradigm. We propose three talks that refute that belief, and discusses methods to leverage the strengths of high dimensional data, either from probability or nonprobability samples, with the strengths of more traditional survey data.

Calibrating Big Data for Population Inference: Applying Quasi-Randomization Approach to Naturalistic Driving Data Using Bayesian Additive Regression Trees

Professor Michael Elliott (University of Michigan) - Presenting Author
Mr Ali Rafei (University of Michigan)
Professor Carol Flannagan (University of Michigan)

Download presentation

Although probability sampling has been the “gold standard” of population inference, rising costs and downward trends in response rates has led to a growing interest in non-probability samples. Non-probability samples, however, can suffer from selection bias. Here we develop “quasi-randomization” weights to improve the representativeness of non-probability samples. This method assumes the non-probability sample actually has an unknown probability sampling mechanism that can be estimated using a reference probability survey. We apply the proposed method to improve the representativeness of the University of Michigan Transportation Research Institute Safety Pilot Study, which consists of a convenience sample of over 3,000 vehicles that were instrumented and followed for an average of one year, using the National Household Transportation Survey as our probability sample of drivers.


Bayesian Inference for Sample Surveys in the Presence of High-Dimensional Auxiliary Information

Mr Yutao Liu (Columbia University)
Professor Andrew Gelman (Columbia University)
Professor Qixuan Chen (Columbia University) - Presenting Author

The National Drug Abuse Treatment System Survey (NDATSS) is a panel survey of substance abuse treatment programs in the United States. In 2013, the NDATSS conducted its first wave of a panel survey on residential non-opioid treatment programs (non-OTPs). A random sample of programs was selected from a sampling frame constructed using the 2010 National Survey of Substance Abuse Treatment Services (N-SSATS), the latest available annual census of all substance abuse treatment programs in the United States when the NDATSS was planned. From 2010 to 2013, the population of residential non-OTPs changed, with new programs opening and some old programs closing. To account for the change in the population as well as potential response bias, we propose a Bayesian multilevel model to improve the survey inference of population quantities in the NDATSS using the newly released 2013 N-SSATS data, which contains a rich profile of up-to-date information about all residential non-OTPs in the nation. In the first level of the model, we regress each survey outcome on the propensity that a program was included in the NDATSS and auxiliary variables in the 2013 N-SSATS that were associated with the survey outcomes of interest. To allow a flexible association between the survey outcomes and the inclusion propensities, we used a penalized spline. In the second level of the model, we further model the propensity of inclusion in the NDATSS using Bayesian classification trees, which can naturally handle high-dimensional auxiliary variables in the 2013 N-SSATS data, interactions and nonlinearity, and uncertainty in the estimation of inclusion propensities for all programs in the population. We then predict each survey outcome for the non-sampled residential non-OTPs using the inclusion propensities and auxiliary variables associated with those programs. We compute the posterior distributions of the population means and proportions via Markov chain Monte Carlo simulation using the Bayesian inference engine Stan.


How Non-Ignorable is the Selection Bias in Nonprobability Samples? An Illustration of New Measures Using a Large Genetic Study on Facebook

Professor Brady West (University of Michigan) - Presenting Author
Professor Phil Boonstra (University of Michigan)
Professor Roderick Little (University of Michigan)
Mr Jingwei Hu (University of Michigan)

Download presentation

Many survey researchers are currently evaluating the utility of "big data" that are not selected by probability sampling. This means that measures of the degree of potential bias from non-random selection of cases from a given population are sorely needed. Existing indices of degree of departure from representative probability samples like the R-Indicator are based on functions of the propensity of inclusion, and are based on modeling the inclusion probability as a function of auxiliary variables. These methods are agnostic about the relationship between the inclusion probability and survey outcomes, which is a crucial feature of the problem. We will first describe and empirically evaluate (via simulation) simple indices of degree of departure from ignorable selection for estimates of means that correct this deficiency, called unadjusted and adjusted potential absolute bias (PAB). The indices are based on normal pattern-mixture models applied to the problem of sample selection, and are grounded in the model-based framework of non-ignorable selection, first proposed in the context of nonresponse by Rubin (1976 Biometrika). The methodology also provides for sensitivity analyses to adjust inferences for departures from ignorable selection, before and after adjustment for auxiliary variables.

We will then apply the proposed indices to data from the Genes for Good (GfG) project (https://genesforgood.sph.umich.edu/), which recruits a non-probability sample of study volunteers via Facebook for genetic profiling and also collects data on important risk factors for cancer (e.g., obesity) for predictive modeling purposes. Using matched genetic profiling data from the Health and Retirement Study (HRS, which is a probability sample; see http://hrsonline.isr.umich.edu/) as a population benchmark, we will use our proposed indices to evaluate the extent of bias in GfG estimates of the prevalence of particular risk factors for cancer for both males and females age 50 and above. We will conclude with recommendations for future work in this area.


Evaluating Doubly Robust Estimation for Online Opt-In Samples With Bayesian Additive Regression Trees

Mr Andrew Mercer (Pew Research Center/Joint Program in Survey Methodology) - Presenting Author

Correcting selection bias in estimates from online, opt-in survey samples requires the use of a statistical model that makes the outcome variable of interest conditionally independent of inclusion in the sample. Elliott and Valliant (2017) describe two broad approaches for addressing this problem: quasi-randomization, which involves fitting a model to predict inclusion in the sample, and superpopulation inference, where the model predicts the survey outcome. A third approach, less commonly discussed in the context of opt-in surveys, is doubly-robust estimation, in which both outcome regression and inclusion propensity models are fit. If either one is correctly specified, the estimates will be asymptotically consistent.

Which approach works best in practice? If researchers have a high degree of confidence in their ability to predict either survey participation or the outcome variable, logic would suggest they choose the approach that best fits with their prior knowledge. However, ignorable selection cannot be known to hold with certainty, and researchers may lack extensive knowledge of potential confounders, especially when there is little visibility into the recruitment and sampling process as is often the case with online, opt-in samples. Doubly-robust strategies could hedge against uncertainty but can have high variance and may be even less accurate if both the propensity and outcome models are incorrect. Given that researchers will often produce survey estimates under less than ideal circumstances, do any of these approaches tend to produce better results when ignorability assumptions are not met?

In this paper, we use Bayesian additive regression trees (BART) to compare each of these estimation strategies for opt-in samples. We compare four estimators: propensity weighting (PW), outcome regression (OR), and two doubly-robust estimators. These are outcome regression with a residual bias correction (OR-RBC), and outcome regression with the propensity score as a covariate (OR-PSC). BART is an accurate, flexible, Bayesian machine learning algorithm that has been shown to perform well in causal inference and missing data imputation settings. It readily accommodates interactions and nonlinearities, reducing the need for assumptions about functional form.

Using 10 parallel survey datasets from different online opt-in sample vendors, we evaluate the performance of each estimator with respect to bias, variance, and root mean squared error on five estimated measures of civic engagement for which ignorability assumptions are clearly invalid. With few exceptions, we find that OR-RBC and PW produce very similar point estimates with the lowest bias, though OR-RBC exhibits lower variance and RMSE. In contrast, OR-PSC and OR also produce very similar point estimates with higher biases, but in this case the doubly-robust OR-PSC estimates display the highest variance and RMSE.