BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Enhancing Survey Quality With Big Data

Chair Dr Daniel Oberski (Utrecht University)
TimeSaturday 27th October, 11:00 - 12:30
Room: 40.010

Applying the Multi-Level/Multi-Source (MLMS) Approach to the 2016 General Social Survey

Dr Tom W. Smith (NORC) - Presenting Author
Dr Jaesok Son (NORC)
Mr Benjamin Schapiro (NORC)

The MLMS approach augments data from interviews with data from the sample frame, paradata, and auxiliary data sources. For the 2016 General Social Survey the sample frame was constructed from data from the US Census and American Community Survey (ACS) and the address listings of the post office. Data from the Census/ACS were used at block/block group, tract, and community levels while the postal information was at the address level. Paradata include observational data at the household and neighborhood level(e.g. physical condition of HU and of the neighborhood) and process data at the household level (e.g. records of attempts). The auxiliary data consist of household-level linkage to several major commercial databases and aggregate-level information on a host of geographically-linked measures (e.g. crime rates, pollution levels, commercial development, health-care resources, economic conditions, voting).

Since these variables exist for all sampled cases, they can be used to measure and model non-response bias, for comparing interview and database information, and for other methodological purposes. For the completed cases, the added variables can be used to study aggregate-level, contextual effects. For example, this includes looking at the impact of local crime rates on attitudes towards law enforcement, the level of pollution on support for government spending for the environment, and the ethnic/racial diversity of neighborhoods on intergroup relations. Altogether across the three sources of added data, hundreds of variables appended to the variables from the interviews themselves.

Besides conducting the above methodological and substantive analyses, the analysis also expands the total survey error paradigm to incorporate data from these supplementing sources. Essentially, Total Survey Error + Auxiliary Data Error + Linkage Error equals Total Data Error.



Using Multiple Imputation of Latent Classes (MILC) to Construct Consistent Population Census Tables Using Data From Multiple Sources

Ms Laura Boeschoten (Tilburg University) - Presenting Author
Mr Jacco Daalmans (Centraal Bureau voor de Statistiek)
Professor Ton De Waal (Centraal Bureau voor de Statistiek)
Professor Jeroen Vermunt (Tilburg University)

Download presentation

National Statistical Institutes (NSIs) often use large datasets to estimate population tables on many different aspects of society. A way to create these rich datasets as efficiently and cost effectively as possible is by utilizing already available administrative data. When more information is required than already available, administrative data can be supplemented with survey data. A major problem is, however, that both surveys and administrative data can contain misclassification.

To overcome the issue of misclassification in both sources, a method is developed which combines multiple imputation (MI) and Latent Class (LC) analysis (MILC). This method estimates the misclassification and simultaneously imputes a new variable that is corrected for that misclassification. Furthermore, uncertainty due to misclassification is incorporated by using multiple imputations. Edit rules can be incorporated in the MILC method, which prevent impossible combinations of scores from occurring in the multiple imputed dataset.

A previous simulation study investigated the performance of the MILC method. Here, it was concluded that its performance was strongly related to the entropy r2 value of the LC model. The entropy r2 indicates how well the LC model can predict class membership based on the observed variables. It was also concluded that the population registries and sample surveys used at Statistics Netherlands are generally of sufficient quality for the MILC method to obtain reliable estimates.

Although the previously mentioned conclusions already give some guidance on when data is suitable for applying the MILC method, a number of issues has yet to be investigated in order for us to apply the MILC method to construct population census tables. More specifically, investigation is needed on how the MILC method handles (1) multiple latent variables, (2) variables with large numbers of categories, (3) the application of edit restrictions on multiple cells and (4) the simultaneous imputation of missing values.

To investigate how the MILC method performs when handling these issues, an existing population census cross-table from the Netherlands is used as the starting point for a simulation study. Here, indicators are generated containing misclassification rates that can be considered high for Dutch population registry and sample survey data. Furthermore, missingness is induced using a MAR mechanism.

Results of this simulation study will give us insight into whether the MILC method can be used to obtain consistent population tables, and possibly provide us with information about the limitations of the MILC method.


Using Machine Learning Models to Predict Follow-Up Survey Participation in a Panel Study

Final candidate for the monograph

Dr Mingnan Liu (Facebook) - Presenting Author
Mrs Yichen Wang (Uber)

Download presentation

Panel attrition is one of the main challenges survey researchers face when conducting longitudinal or panel studies. Panel attrition means that survey participants in the previous wave of survey stop participating in follow up surveys. Such phenomenon imposes challenges to data quality of panel studies. Therefore, researchers and practitioners are constantly exploring options of motivating panel survey participants to remain in the panel and take part in follow-up studies. The likelihood of attrition varies from one survey respondent to another so the most effective approach to prevent panel attrition is by finding out respondents who are most likely to stop participating in the follow-up surveys and focus on getting those respondents to participate.

In this study, we will use machine learning models to predict the likelihood of participating in the follow-up survey in a national telephone panel study. Given that respondents already have participated in one wave of data collection, we now have relatively rich data, including responses to survey questions, interviewer observation, and paradata. We will use supervised learning techniques to predict the likelihood of participating in the follow-up survey. Specifically, we will present results from several classification models, such as logistic regression, k-nearest neighbors, and random forest. We will apply these methods on Surveys of Consumers national telephone survey. This survey has two data collection components, a first interview and a re-interview. We will predict the likelihood of participating in the re-interview with data obtained from the first interview.

Machine learning is not limited to predicting panel attrition. In fact, researchers can use a similar method to predict survey participation propensity in general. In that case, researchers will have to rely on data available from the sampling frame and other paradata (such as interview observation data.) This technique can also be used on survey panels where respondents voluntarily sign up to take surveys. Identifying panelists who are likely to churn from the panel and exercising targeted and timely intervention will have an important implication on survey panels. This is a typical example of using machine learning methods for modeling human behaviors.


Designing Surveys to Account for Endogenous Nonresponse

Professor Michael Bailey (Georgetown University) - Presenting Author

Non-response is a large and growing problem in survey research. Weighting can address non-response associated with observable variables, but cannot solve - and may exacerbate - non-response bias associated with unmeasured factors. Selection models can correct for non-response related to both measured and unmeasured factors, but prove either unwieldy or impossible for most conventional survey data. This paper argues that surveys should be designed to provide the information needed to make selection models function properly. In particular, this paper focuses on two tools that enable survey data to be used to assess selection on unmeasured factors. First, surveys can include questions that elicit willingness to respond independent of content of response. Second, by randomly treating some potential respondents with opt-in questions, we produce a variable that explains response, but does not affect outcome variables directly. Taken together, these tools allow us to easily assess weighting models' assumption that willingness to respond is unrelated to opinions. Two empirical applications demonstrate the potential for non-response bias to exaggerate polarization and turnout. Many suspect that non-response bias is an important factor in polling mishaps. Over the last ten years response rates in the U.S. have plummeted and now are under ten percent for landlines and under eight percent for cell phones (Dutwin and Lavrakas 2016; Pew Research Center 2012). Academic surveys are not immune from declining response rates; some important academic polls have even abandoned random sampling, at least as conventionally understood. Potential biases that emerge in such contexts may be less public than for election polls, but are highly troubling nonetheless. The conventional way to address non-response is via weighting, which produces an effective sample that reflects the target population with respect to selected measurable attributes. Weighting comes with a substantial drawback however: it fails to correct for non-response associated with unmeasured attributes. That is, weighting fails if the propensity to respond to a survey is endogenous (or, non-ignorable), meaning that non-response is related to the content of opinions after controlling for measured variables. Survey researchers using weights rarely diagnose whether the conditions necessary for weighting to be useful are satisfied. One reason why pollsters seldom test for endogenous selection is that selection models are so demanding of data that they are often unusably low-powered and unreliable for survey data. This paper presents a two-fold strategy for designing surveys so that they produce the kind of data needed to identify endogenous selection. First, surveys can include questions that elicit respondents' propensity to discuss politics independent of their opinions about politics. This information can be used to directly test weighting models' assumption that non-response is ignorable conditional on covariates. Second, pollsters can randomly assign respondents to conditions that affect the probability of response, but do not affect the content of opinions. This can be done in many ways, but it is particularly easy to implement with randomized treatments that inhibit response.


Sunday Assemblies: From "Believing Without Belonging" to "Belonging Without Believing"? When Survey and Big Data Combine to Study an Under-Theorized Phenomenon

Dr Francesco Molteni (Università degli Studi di Milano, Dipartimento di Scienze Sociali e Politiche) - Presenting Author
Dr Massimo Airoldi (Lifestyle Research Center, Emlyon Business School)

Drawing from the mixed methods literature, the authors have identified three different ways in which Big Data and Survey Research can be fruitfully combined. We intend explanatory design as the situation in which Big Data are used to explore a certain phenomenon to calibrate survey questions, to find the most appropriate administration contexts or to find specific and hard-to-reach populations. When speaking about complementary design, we mean instead a situation in which survey data and Big Data triangulate or converge within a single data structure. Lastly, with interpretative design we intend a situation in which Big Data are used to interpret or validate survey results.

The exploratory design is well suited to study phenomena which are under-theorized because it boosts the knowledge needed to develop a complete survey instrument by lessening at the same time the researcher’s biases. In this paper we provide an example of good practice for what concerns this kind of design.

Sunday Assemblies are kind of Atheist Churches in which members meet “to celebrate life”. There is no doctrine nor deity but a strong community mission and a strong focus on individual fulfilment. If we refer to the classic literature about religiosity, these Sunday Assemblies can be hardly placed within the classical categories of religious behaviour. In fact, they seem to share with the alternatives some kind of “holism” and “broad spirituality” as well as a focus on individual well-being. Individuals attending Sunday Assemblies practice a lot, but this practice is somehow opposed to the institutional religious practice and, of course, as those belonging to the secular category, they reject the main institutional belief.

The way in which these communities are present online makes this a perfect case for an exploratory design. Using data from 96 Sunday Assembly Facebook pages (every S.A. has its own FB page) we built a dataset containing 5590 comments to 2432 posts authored by 1861 different users. Starting from this dataset we conducted a data-driven analysis using a text mining technique known as topic modelling to identify 6 core-topics. These topics basically represent the main nodes around which the users discuss about their membership to Sunday Assemblies as well as how they define and perceive themselves as members.

Using the results of this Big Data exploratory analysis, we calibrate a web-based Survey that will be administered to the members of the FB pages with a kind of snowball approach, starting from the administrators of the main Sunday Assembly Page (TheSundayAssembly).

This two-step approach permits us to dig deeper into the more relevant features of these Atheist Churches - such as their focus on social interactions and community, their continuous boundary-work along the line between religion and atheism and their emphasis on individual fulfillment – without relying a priori on assumptions that may risk to bias the survey design.