BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Getting your estimates on point! Survey calibration approaches in the era of probability and nonprobability surveys

Moderator: Stas Kolenikov ([email protected])
Slack link
Quick Zoom

Detailed zoom login information
Friday 20th November, 10:00 - 11:30 (ET, GMT-5)
7:00 - 8:30 (PT, GMT-8)
16:00 - 17:30 (CET, GMT+1)

Measures of selection bias in regression coefficients estimated from non-probability samples

Dr Brady West (Institute for Social Research) - Presenting Author
Dr Roderick Little (University of Michigan)
Dr Rebecca Andridge (Ohio State University)
Dr Phil Boonstra (University of Michigan)
Dr Erin Ware (Institute for Social Research)
Ms Anita Pandit (University of Michigan)
Ms Fernanda Alvarado-Leiton (Institute for Social Research)

Download presentation

Selection bias in survey estimates is a major concern, particularly for non-probability samples. Recent developments, including studies presented at BigSurv18, have provided survey researchers with model-based indices of the potential selection bias in estimates of means and proportions computed from non-probability samples that may be subject to non-ignorable selection mechanisms. To our knowledge, there are currently no systematic approaches for measuring selection bias for regression coefficients, a problem of great practical importance. Generalizing recent developments, we derive novel measures of selection bias for estimates of the coefficients in linear regression models. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about the extent of the non-ignorable selection, and they leverage auxiliary variables available for the population of interest that provide information about the variable being modeled when conditioning on the predictors of interest.

After reviewing conceptual and technical details related to these measures, we describe a simulation study that demonstrates the effectiveness of the measures across a variety of different scenarios. We then apply the measures to data from two real studies. In the first application, we analyze data from the Genes for Good project, which aims to predict health indicators with polygenic scores computed from a large volunteer sample that is recruited via Facebook and provides genetic information by mail. We demonstrate that our measures can effectively detect bias in estimates of regression model coefficients based on these data, using the U.S. Health and Retirement Study as a population benchmark. These findings have important implications for other studies analyzing the relationships of polygenic scores with outcomes of interest. In the second application, we analyze data from the U.S. National Survey of Family Growth, examining predictors of months worked in the past year among a hypothetical non-probability sample of smartphone users with lower education. Using the full NSFG sample as a population benchmark, we again demonstrate the ability of these measures to indicate potential selection bias. We conclude with recommendations for the use of these measures in practice.

A method to construct the control group for a non-probability sample in clinical trials

Ms Yiqing Wang (Southern Methodist University ) - Presenting Author

Asthma is becoming one of the most prevalent chronic diseases in the United States, afflicting about 6.2 million children. Since long-term self-management is believed to be effective in controlling chronic asthma, many researchers have investigated the effect of technology-based interventions, including texts and phone calls, on asthma control. A common way to investigate this question is with a prospective randomized controlled study to detect the change of some medical characteristic of interest between the case group and the control group. However, because of ethical regulations and limited research resources, the sample size is often small and the experimental period short, which reduces the test power to detect the change. Beginning in 2016, the Parkland Community Health Plan began to provide mobile phone short message service (SMS), including health education and self-management instructions, for children with asthma in their pediatric population and their guardians. However, instead of a randomized controlled trial, patients were included in the intervention group only if they volunteered. Therefore, the resulting sample was similar to a non-probability sample. To make a valid inference, we had to construct a a control group, in which individuals were expected to be identical to those in the intervention group at the baseline.

In this paper, we describe the method we used to construct a control group. We developed a propensity score to measure similarities of patients between the cases and the large population of patients who were not receiving treatment. A logistic regression model to predict probability of enrollment into the intervention group was constructed using demographic and medical characteristics of all asthma patients in the database. We take the predicted value as the propensity score. Next, for each patient in the intervention group, all patients in the control group that matched that patient on a set of demographic characteristics was selected. Then the patient with the most similar propensity score to the intervention subject was selected into the control group. By employing this method, we selected 326 out of 36500 patients as controls. We discuss the effects of the technology-based intervention on asthma control. We also discuss the differences in the medical characteristics of the two non-probability samples selected in this manner, and the effects we believe they have on the results.

The national center for health statistics' research and development survey

Dr Jennifer Parker (NCHS/CDC) - Presenting Author
Dr Kristen Miller (NCHS/CDC)
Dr Yulei He (NCHS/CDC)
Dr Paul Scanlon (NCHS/CDC)
Dr Katherine Irimata (NCHS/CDC)
Dr Hee-Choon Shin (NCHS/CDC)
Mr Bill Cai (NCHS/CDC)
Dr Van Parsons (NCHS/CDC)
Dr Chris Moriarity (NCHS/CDC)

The National Center for Health Statistics is assessing the usefulness of recruited web panels in multiple research areas through its Research and Development Survey (RANDS) program, including the use of close-ended probe questions and split-panel experiments for evaluating question-response patterns and the investigation and development of statistical methodology that leverages the strength of national survey data to evaluate and possibly improve health estimates from recruited panels. Recruited web panels, with their lower cost and faster production cycle, in combination with established population health surveys, may be useful for some purposes for statistical agencies. Our initial results indicate that web survey data from a recruited panel can be used for question evaluation studies without affecting other survey content. However, the success of these data to provide estimates that align with those from large national surveys will depend on many factors, including further understanding of design features of the recruited panel (e.g. coverage and mode effects), the statistical methods and covariates used to obtain the original and adjusted weights, and the health outcomes of interest. This presentation will provide an overview of the RANDS program, including descriptions of the first four rounds of data, highlights of specific measurement and estimation research projects, and information about data access for external users.

Boosted kernel weighting - Using statistical learning to improve inference from nonprobability samples

Mr Christoph Kern (University of Mannheim) - Presenting Author
Professor Yan Li (University of Maryland)
Ms Lingxiao Wang (University of Maryland)

Download presentation

Given the growing popularity of nonprobability samples and (big) web data as cost- and time-efficient alternatives to probability sampling, a variety of adjustment approaches have been proposed to correct biases, e.g., due to self-selection in non-random samples. Popular methods such as inverse propensity-score weighting (IPSW) and propensity score adjustment by subclassification (PSAS) utilize a probability sample as a reference to estimate pseudo-weights for the nonprobability sample based on propensity scores. A recent contribution, Kernel Weighting (KW), has been shown to be able to outperform IPSW and PSAS with respect to bias reduction and efficiency. However, the effectiveness of these methods critically depends on the ability of the underlying propensity model to reflect the true selection process, which is a challenging task with parametric regression. In this study, we combine the KW approach with propensity scores that are estimated with machine learning methods which provide added flexibility over logistic regression. Specifically, we compare KW pseudo-weights that are based on model-based recursive partitioning, conditional random forests, gradient tree boosting, and model-based boosting in simulations as well as in a real data example. In this context, we propose to optimize covariate balance between the probability and nonprobability sample as the tuning criterion. Our results indicate that particularly boosting methods represent promising alternatives to logistic regression and result in KW estimates with lower bias in a variety of settings, without increasing variance. This is especially the case for complex selection scenarios where the true selection model includes, e.g., strong non-additive terms.