BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Big Data Methods, Small Survey Nonresponse?

Chair Mr Roeland Beerten (Statistics Flanders)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.213

Advances in Modelling Attrition: The Added Value of Paradata and Machine Learning Algorithms

Dr Peter Lugtig (Utrecht University) - Presenting Author
Professor Annelies Blom (University of Mannheim)

Download presentation

Panel surveys are notoriously expensive to maintain, mainly because attrition decreases the available sample size over time. However, models that aim to predict attrition often have low predictive power, thus challenging efforts to counter attrition during data collection. With the advent of online panels, data about the survey process (paradata) have become more easily and more abundantly available. We argue that such paradata should be used in statistical models that aim to predict attrition. We use data from the German Internet Panel (GIP) to test whether a wide range of paradata improve prediction and find that they do. We use data from the 2012 recruitment of the GIP to train models and then use the 2014 recruitment data to test whether the strongest predictors of attrition at every wave generalize across recruitment rounds. The machine learning models allow us to identify very specific subgroups that are at a high risk of attrition and that may be targeted during fieldwork to prevent attrition before it occurs.


Preserving our Precious Respondents: Predicting and Preventing Non-Response and Panel Attrition by Analyzing and Modeling Longitudinal Survey and Paradata Using Data Science Techniques

Mr Joris Mulder (CentERdata /Tilburg University) - Presenting Author
Dr Natalia Kieruj (CentERdata / Tilburg University)

Download presentation

Non-response and attrition of panel members are two well known issues in online survey research. These issues have a negative effect on panel and data quality. Non-response is especially problematic since it is a selective (non-random) process causing response biases: respondents that do respond tend to differ in a significant manner from the respondents that do not respond. A high attrition rate can become a costly business, since this implicates the need for more frequent recruitment of new respondents. Furthermore, for the longitudinal character of online panels, it is highly desirable to keep as many respondents in the panel as possible.

There are generally two approaches to address non-response and attrition: (1) adjusting and correcting afterwards (i.e. weighing data, recruiting new panel members) and (2) minimizing non-response before (and during) the data-collection stage and preventing panel members from dropping out of the panel. In this research we focus on the latter by applying data science techniques.

The LISS panel is a probability-based online panel in The Netherlands representative for the Dutch population. It is based on a true probability sample of households drawn from the population register by Statistics Netherlands. The panel consists of 4500 households, comprising 7000 individuals and is in full operation since October 2007. Even though survey response rates are generally quite high in the LISS panel and attrition rates are kept relatively low (i.e. due to monetary incentives, reminder e-mails, cost-free helpdesk) the issues of non-response and attrition still persist.

Over the last 11 years we have not only collected a vast amount of survey data, but also a very rich and diverse source of meta and paradata. In this research we introduce new prediction models which are based on these self-reported survey data (e.g. health, personality, religion, norms and political values) socio-demographic data (e.g. profession, education, age, sex, household comprising, income, rural or urban living area) and meta and paradata (e.g. quantity of surveys per respondent per month, sample designs, survey non-response, duration to fill out the surveys, time of year, surveys topic, received incentives, device and browser info, time-stamps). The prediction models are developed using data science and machine learning techniques like pattern recognition, causal inference and model selection for prediction and classification (i.e. K-Nearest Neighbour, Linear and Quadratic Discriminant analyses, cross-validation, tree based methods like random forest, bagging, boosting, etc.).

With these models we gain insight into the underlying reasons for non-response and attrition. Also, by applying these models we can predict and anticipate more accurately on these issues. Interventions can be made to minimize non-response and attrition and (sub)samples will yield a more representative response. These interventions will ultimately benefit the overall panel and data quality.


Using Predictive Modeling to Identify Panel Dropouts

Dr Jan-Philipp Kolb (GESIS) - Presenting Author
Dr Christoph Kern (University of Mannheim)
Dr Bernd Weiß (GESIS)

Panel surveys provide a valuable data source for investigating a wide range of substantive research questions. These surveys are used extensively in the social sciences and related disciplines. However, panel data quality can be challenged substantially by panel attrition, which is often considered to be one of the most severe problems of longitudinal data collection. In its most critical form, panel attrition can be driven by selective nonresponse patterns, eventually leading to decreasing sample sizes and biased estimates. Consequently, it is of utmost importance to identify panelists that are at risk of dropping out of a panel. The identification can be used to eventually pre-correct for systematic nonresponse over time.

Once it is possible to identify at-risk panelists, interventions can be implemented that aim to motivate these panelists to further participate in the panel study (responsive survey design). However, identifying potential drop-outs is a challenging task given the vast amount of information that is typically available in panel studies. This is particularly the case for online panels, which allow collecting extensive process-based paradata.

In this study, we aim to utilize machine learning methods with a diverse set of predictor variables to tackle panel attrition from a prediction perspective. We study attrition in the GESIS Panel, which is a bi-monthly probability-based mixed-mode access panel of the German population (n = 4,700). With a large number of panelists participating online (about 60%), different types of datasets can be used to derive a considerable number of potential features. In addition to socio-demographic and substantive variables, this includes process-based paradata as well as extensive data from the panel management. Feeding this information to supervised machine learning methods offers a promising avenue for building an effective nonresponse prediction model, as these methods allow to model complex relationships across many features without the need of specifying the models' functional form in advance.

In this setting, the present study investigates the potential of using features from diverse sources for predicting drop-outs in the GESIS Panel by studying prediction performance across models and by exploring variable importance measures. Also, different statistical learning techniques will be employed and compared regarding prediction accuracy. Results of this approach can be used as a guideline for developing an effective model for predicting panel drop-outs in advance.

Preliminary findings suggest that random forests exhibit the best results regarding precision and recall, whereas variable importances vary across different statistical learning techniques. While the current approach uses extensive data from one pair of panel waves, further research will aim at utilizing multiple panel waves for nonresponse prediction.


Operational Challenges in Gaining and Maintaining Survey Respondents' Cooperation to Supplement Survey Data with "Big Data" Collected Through a Custom Smartphone Application

Dr Kristine Wiant (RTI International) - Presenting Author
Ms Diana Greene (RTI International)
Ms Joli Brown (RTI International)
Ms Ellen Causey (RTI International)

The use of custom smartphone applications to gather data on research participants’ behavior offers survey researchers opportunities to supplement survey data collection with “big data”. With these opportunities come challenges in gaining and maintaining respondents’ cooperation for a more invasive and ongoing collection of information. In this presentation, we provide a case study of the use of a custom smartphone application designed to record the amount of time spent in convenience stores. The smartphone data are analyzed as a proxy for exposure to public-education advertisements that are present in these stores. This case study comes from a 4-wave longitudinal study that evaluates the effectiveness of a public education media campaign designed to motivate current cigarette smokers aged 25-54 to quit smoking. Print version of the public-education advertisements are present primarily at convenience stores where tobacco products are sold. Baseline survey data are collected as part of in-person interviews in 15 U.S. counties in which the campaign is active and in 15 U.S. control counties in which the campaign is not present. At the end of the survey, participants who have a smartphone are asked to install the custom smartphone application. When participants consent to audio-recordings, portions of the interview interaction, including the process of gaining consent for and downloading the smartphone application, are recorded. Based on our review of these recordings, we will summarize qualitative data on the respondents’ questions, concerns, and challenges in downloading and installing the smartphone application. Interventions that we implemented based on our review of these recordings will also be discussed. Finally, we will present initial cooperation rates, rates of participation drop-off within the first 3 months of installing the smartphone application, and strategies that we used to encourage ongoing participation in the smartphone data collection.