BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Exploring How Responsive Designs Respond to Machine Learning Methods

Chair Dr Gonzalo Rivero (Westat)
TimeFriday 26th October, 16:00 - 17:30
Room: 40.213

Responsive Designs in Practice

Dr Roger Tourangeau (Westat)
Dr Gonzalo Rivero (Westat) - Presenting Author
Mr Brad Edwards (Westat)
Mrs Tammy Cook (Westat)

Responsive designs for data collection are one of the most promising theoretical approaches that attempt to counter falling response rates for surveys. Adapting the data collection strategy during the field period to incorporate available information about the expected probability of cooperation from the sample cases is expected to increase efficiency and reduce the cost of data collection. Responsive design is especially promising for panel surveys because (after the first wave) so much more information is available about the sampled units and their response patterns, compared to cross-sectional surveys, which can be used to mitigate attrition. However, while appealing in theory, in practice responsive designs pose a number of implementation challenges in face-to-face surveys. First, the design must consider both the predicted propensity to respond of each case and the overall sample composition in order to minimize biases from differential cooperation rates among subgroups. Second, the suggested workplan for the cases each interviewer is directed to work has to account for the geographical location of the respondents to ensure that the recommendations are feasible and optimize the interviewer’s travel schedule. Finally, instructions must be communicated in a timely way and accounting for the potential resistance of interviewers to follow instructions based on predictions made by the home office. In this paper, we discuss the experience of Westat in the design, development, implementation, and testing of a full responsive design model for the PATH Reliability study, a two-wave study on tobacco use. We focus on the analytical components of the design, including the predictive models for response propensity and sample balance, the optimal routing model to help interviewers plan the day. We discuss a set of experiments that we deployed to measure compliance with alternative presentations of the instructions and the total effect on operational costs of alternative sample collection designs.


Machine Learning in Adaptive Survey Designs: A Bandit Approach

Mr Rob Chew (RTI International) - Presenting Author
Dr Paul Biemer (RTI International)

Download presentation

Surveys are frequently designed with a great deal of uncertainty about key outcomes (e.g., response rates) and the efficacy of planned interventions (e.g., incentive offers). Adaptive survey design (ASD) is a strategy for mitigating this uncertainty by dynamically adjusting designs during data collection. By modifying the design strategy based on information learned during collection, ASD can help reactively control cost and reduce measurement error or nonresponse bias.

The general approach in adaptive design is to identify potential risks related to costs or survey estimates, to develop appropriate indicators for tracking these risks, and to monitor these indicators during collection. In theory, this provides survey researchers the knowledge to make flexible design changes during the data collection process, helping to control costs or errors. However, in practice, effectively assessing multiple design options can quickly become complicated, as there is a trade-off between the cost and time required for experimenting and the benefit of finding a better design. Given the uncertainty around key design features, how can researchers discover the most effective survey design while minimizing the cost of discovery?

Multi-armed Bandits (MABs) are a suite of algorithms studied extensively in reinforcement learning, a subdomain of Artificial Intelligence, that balance the trade-off between exploitation and exploration of desired outcomes. MABs are designed to address the question at each step of the process of whether we should continue with our current path (exploitation) or whether we should experiment with something else that might be even better (exploration)? This “earning while learning” approach highlights the opportunity cost of examining different treatments, in contrast to traditional randomized experiments in which treatment effectiveness is only assessed after data collection is complete. In particular, this approach offers potential solutions to problems that many survey methodologists face during the experimental phase of a responsive design, where decisions need to be made regarding the main data collection before the experimental data are completely observed and analyzed.

This paper details the progress made in using MAB algorithms to reduce unit non-response. Field decisions will be compared to simulated recommendations derived from a MAB mechanism to assess the impact on important operational metrics such as cost and response rates.


Mining Interviewer Observations in a Panel Survey

Mr Daniel Guzman (SRO/University of Michigan) - Presenting Author

Download presentation

Gaining cooperation of survey respondents is getting more and more difficult. There is a large body of literature that shows the decline in survey participation over the last decade. As a result, survey data collection is getting more expensive. Finding strategies to make data collection more cost effective is a priority for large studies.

The use of ancillary data can assist with tailored approaches to gaining participant cooperation based on the main factors preventing survey participation. In particular, paradata is a useful source of information for all cases (respondents and non-respondents). Sample management systems often keep track of call records and observations about the interviewer-householder in each interaction. For example, interviewers are asked to describe with their own words the nature of each attempt, even if there is no contact with a respondent or informant.

Records of interviewer observations are a rich source of information, but they can be hard to analyze. In some cases, interviewer observations are grouped into pre-specified categories. However, in an ever-changing society, it can be difficult to anticipate the main factors affecting survey participation. Also, the richness of the full description of interviewer observations is lost in the closed categorization.

In this paper, I plan to use paradata from the Health and Retirement Survey (HRS). HRS is a longitudinal survey of a representative sample of people in the United States over age 50. The study interviews approximately 20,000 respondents every two years. The objective is to use text mining methods, such as topic analysis and sentiment analysis, to analyze interviewer observations at the household level from previous waves, to predict respondent participation in the next wave. The estimated propensity can be used as an input for survey managers to plan differential strategies for low propensity cases to improve survey participation.


Predicting Response Mode Preferences of Survey Respondents: A Comparison Between Traditional Regression and Data Mining Methods

Ms Mahi Megra (American Institutes for Research) - Presenting Author
Dr Rebecca Medway (American Institutes for Research)
Mr Michael Jackson (American Institutes for Research )
Ms Cameron McPhee (American Institutes for Research )

An effective tailored survey data collection protocol can increase response rate and efficiency (Dillman, 2014; Stern, 2014). In this vein, the National Household Education Surveys (NHES) program has been experimenting with model-driven responsive designs that target subgroups with varied data collection protocols based on what is predicted to be the most effective approach for those groups. Recently, researchers have begun to assess the utility of data mining techniques for predicting survey outcomes (Buskirk, 2015; McCarthy, 2009; Phipps, 2012). These methods are of particular interest for the NHES given the recent addition of nearly 300 commercial data variables to the NHES address-based sampling frame. In this presentation, we will compare traditional modeling approaches with ensemble methods to assess whether nonparametric methods offer an improvement in predicting response mode preference for the NHES.

The data come from the 2016 NHES data collection. While the majority of NHES:2016 cases were assigned to a paper-only protocol, a random subset was assigned to an experimental web-push condition. Offering the web option was successful, as it increased the overall response rate for this two-phase survey, and improved data collection and data processing efficiency. However, it had the negative effect of decreasing the screener response rate, especially among particular subgroups– suggesting that some sample members still prefer the paper-only protocol. Hence, the next NHES administration will include an experiment where those cases that are predicted to prefer paper response (based on response patterns in NHES:2016) will be assigned to a paper-only protocol.

We will compare three approaches for developing a model that identifies those cases that are likely to prefer to respond by paper. These approaches will vary in terms of whether they use a traditional or a nonparametric approach for selecting predictor variables and the actual modeling of response outcome. For the first approach, we will use stepwise selection to identify the variables that should be included in the model, and a binary logistic regression model with survey response status as the outcome variable. The second approach will also use a binary logistic regression model; however, we will use a conditional inference tree to select predictor variables. The logistic regression models will include a mode condition indicator (i.e., paper vs web) for each case. After these models are developed and actual response propensities obtained, we will also calculate counterfactual response propensities by assigning each case the opposite mode condition indicator value. The difference between the actual and counterfactual response propensities will allow us to identify mode preference for each case. Finally, for the third approach, conditional forests will be used for both variable selection and modeling response outcome. Here, we will grow two conditional forests—one for the paper cases and another for the web cases. After growing the forests based on each case’s assigned mode, we will use the other tree to obtain the counterfactual response propensity. The results of this analysis will help researchers assess whether there is a benefit (or cost) to