BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





New Approaches to Augment Sampling Frames I: Does Bigger Data mean Better Sampling Frames?

Chair Dr Trent Buskirk (Center for Survey Research, UMass Boston)
TimeSaturday 27th October, 14:00 - 15:30
Room: 40.008

Investigating the Value of Appending New Types of Big Data to Address-Based Survey Frames and Samples

Dr Paul Lavrakas (Independent Consultant) - Presenting Author
Ms Ashley Hyon (Marketing Systems Group)
Mr David Malarek (Marketing Systems Group)

Download presentation

Considerable advances have taken place in survey science in the past two decades through the usage of data that can be appended to survey sampling frames. These advances continue to take place as survey researchers delve creatively into ways that appended frame data can aid in survey recruitment, weighting, and/or nonresponse bias investigations. Traditionally in the United States, these data have included census block group data including variables such as the percentage of dwelling units in the local area of the sampled address that are single family houses, the proportion of adult females in the labor force, the median household income, and the proportion of those 25 years of age or older with a bachelor’s degree. In addition to these government statistics, survey sample vendors have access to other non-public “big data” frames that can be matched to the level of an individual address. These variables come primarily from commercial sources and are not without error. They include variables such as the age, educational attainment, race and Hispanicity of the head of household. Despite the measurement errors inherent in these types of household level variables some have been found to be statistically reliable (and meaningful in a multivariate context) predictors of survey response (Jackson, McPhee, and Lavrakas, 2015). Furthermore, although such household specific data are not available for all addresses, the “missingness” of such variables is sometimes found to be a reliable and important predictor of survey response (Jackson et al., 2015). Following this line of research, we will use a national survey dataset with nearly 100K cases that were sampled in 2017 for our presentation. Response to this mail survey was approximately 22%. To each sampled address we originally appended traditional block level census data and some household level characteristics. We will report on the nature of the nonresponse bias in this survey using these appended characteristics in our first set of analyses. Then we will append a much larger set of nontraditional characteristics – ones that are heretofore uncommon for survey researchers to use -- to the 100K survey addresses. These demographic and psychographic characteristics – coming from varied commercial “big data” sources – include a great deal of additional information about the household and its members, such as credit card and other financial expenditure history, ownership status, years in neighborhood, mortgage debt, automobile ownership, SES categorizations, ownership of pets, and other lifestyle indicators. We then will conduct a second nonresponse bias investigation and compare the results to our first nonresponse bias investigation. Our research is expected to show that as more and more big data can be appended to addresses, survey researchers will be able to conduct more informative (valuable) investigations into nonresponse bias. In addition, we will discuss how the richness of information from new big data sources that can be appended to individual addresses can and should be used by survey researchers to predict survey response before the field period begins, and how to leverage this information in gaining survey cooperation via a response propensity modeling.


Is More Data Better Data? Assessing the Quality of Commercial Data Appended to an Address-Based Sampling Survey Frame

Dr Rebecca Medway (American Institutes for Research) - Presenting Author
Mrs Nicole Guarino (American Institutes for Research)
Ms Carol Wan (American Institutes for Research)
Mrs Danielle Battle (American Institutes for Research)

Download presentation

Given declining survey response rates, there has been increasing interest in identifying ways to make data collections more efficient, such as using targeted or responsive designs. Such designs rely on the availability of high-quality data that is predictive of outcomes of interest, such as eligibility or response (Buskirk, Malarek, & Bareham 2014; West et al. 2015). When conducting surveys that use address-based sampling (ABS) frames, such as the National Household Education Surveys (NHES), it is possible to append auxiliary data to the frame that can then be used to facilitate targeted or responsive designs. This is especially critical for mailed surveys such as the NHES that tend to lack rich paradata. However, studies evaluating the quality of auxiliary data find that it can suffer from high rates of missingness and that the available information may be of varied quality (Disogra, Dennis, & Fahmi 2010; Pasek et al. 2014).

Building on the efforts of Buskirk et al. (2014), West et al. (2015), and others, this presentation will report on the quality and utility of a newly acquired commercial data source that was appended to the NHES’s ABS frame. Prior frames only included about 15 address characteristic variables provided by the frame vendor, and those variables appear to be of limited utility for predicting NHES survey outcomes (Jackson, Steinley, & McPhee 2017). The new commercial data source is promising because it includes about 300 additional variables on topics ranging from voting history to commercial behavior. However, its potential use involves some challenges; for example, while the NHES is a survey of addresses, the commercial data consists of person-level records that the vendor matches to addresses, and which researchers then must translate to address-level characteristics prior to analysis.

This presentation will report on the steps taken to prepare this commercial data for use in developing a responsive design for NHES:2019, as well as assessing the quality of the data itself. First, we will report on the rate at which the commercial data vendor was able to match records to the NHES:2016 and NHES:2017 frames. We also will assess the quality of the person-level matches and discuss the procedures taken to generate address-level characteristics from person-level records. Next, we will report on the extent of missing data among matched cases, both at the address-level (what percent of variables are missing for this address?) and variable-level (what percent of addresses are missing data on this variable?). We also will report on the agreement rate between the commercial data and NHES survey responses (for example, when the commercial data suggests there is a child present, did respondent households also report having at least one child on the survey?). Finally, we will touch on whether incorporating this data increases the predictive power of models predicting key survey outcomes, such as response or eligibility.

The results of these analyses will help researchers gain insight into the feasibility of appending this type of commercial data to ABS survey data.


Feedback Loop: Using Surveys to Build and Assess RBS Religious Flags

Final candidate for the monograph

Dr David Dutwin (SSRS) - Presenting Author

Commercial data sources such as voter registration databases (registration-based sample or RBS) are used for a wide variety of purposes in the U.S., from targeting voters and election polling to consumer targeting and traditional survey research. Such databases not only include voting histories of hundreds of millions of Americans but a range of appended consumer variables from magazine subscriptions to demographics, many based on actual behaviors, and others modeled. The SSRS omnibus is a large scale survey run for over 30 years, with somewhere between 50,000 and 100,000 interviews conducted annually with a simple random sample dual-frame design. Merging these two sets of data offers two opportunities. First, models based on self-reported data from the omnibus can be built into and applied to the RBS on a range of reported metrics. Second, omnibus data can be used to assess the accuracy of pre-existing variables in the RBS. One example is religion: Since the omnibus has asked about religion for a decade, such data can be merged into the RBS both to assess the effectiveness of pre-existing model-based religion metrics in the RBS, and also to build new religion models. This paper reports on the merging of these two data and the building of two different tree-based models (binary classification tree and CHAID) and an ensemble of classification trees built by random forest models to score the full RBS database on the probability of its cases to be of a certain religion (Christian, Catholic, Jewish, Mormon, Muslim, Agnostic, etc) based on self-reported religion from the SSRS omnibus, and as well reports on the assessment of the accuracy of a pre-existing religion variable in the RBS. I also then assess the effectiveness of both the preexisting RBS religion model and the omnibus-based model for being Jewish with a survey of the Jewish population in the Detroit metropolitan area. This real world application allows us at least on a local scale to assess which of the many models built (varied by tree type (Random Forest, CART, or CHAID), extent of pruning, degree of downsampling, and validation type) is most accurate at modelling the Jewish population. Additionally, I report on the overall effectiveness of modelling religious groups with a combined omnibus-RBS dataset and the potential utility of such models for consumer research and demography.


Research on Combination of Probability and Nonprobability Samples

Dr Nadarajasundaram Ganesh (NORC at the University of Chicago) - Presenting Author
Dr Edward Mulrow (NORC at the University of Chicago)
Dr Michael Yang (NORC at the University of Chicago)
Ms Vicki Pineau (NORC at the University of Chicago)
Dr Adrijo Chakraborty (NORC at the University of Chicago)

Probability sampling has been the standard basis for design-based inference from a sample to a target population. In the era of big data and increasing data collection costs, however, there has been growing demand for methods to combine data from probability and nonprobability samples in order to improve the cost efficiency of survey estimation without loss of statistical accuracy (or perhaps even with improvements in statistical accuracy). A task force of NORC statisticians is charged to develop a toolkit that would represent standard approaches to combining probability and nonprobability samples at NORC. This toolkit will have the following components: (1) statistical procedures; (2) computer programs for implementing such procedures; and (3) communication materials to summarize the procedures. This paper reports on one of the methods we are in the process of developing. Given bias and coverage error inherent in non-probability samples, use of traditional weighted survey estimators for data from such surveys may not be statistically valid. We discuss the use of small area estimation models to estimate the bias associated with non-probability samples assuming the smaller probability sample yields unbiased estimates. We will discuss both frequentist and hierarchical Bayesian approaches to combining data. Given clients are often interested in simpler solutions, we propose incorporating efficiency gains associated with the model-based small area estimates into the survey weights using calibration weighting.