BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Smart TVs, Smartphones, and now Smart Surveys: Building Smarter Surveys Using Big Data Tools

Chair Dr Stas Kolenikov (Abt Associates)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.109

Augmenting Surveys: An Efficient Framework for the Storing, Querying, and Processing of Big Survey Data

Dr Jack Kuang Tsung Chen (SurveyMonkey) - Presenting Author
Mr Jon Cohen (SurveyMonkey)
Mr Reuben McCreanor (SurveyMonkey)

In the age of big data, we have moved from traditional (telephone or paper and pencil) surveys with hundreds of respondents, to web surveys that are taken by hundreds of thousands, or even millions of respondents at a time. As digital storage cost continues to drop exponentially, big survey data are more available than the survey field could have imagined just a decade ago. With mountains of digital information, survey researchers are spending a tremendous amount of time on the unglamorous process of querying, cleaning, and aggregating data. The reality is that the majority of the time for most survey-based research is spent on finding where the data is stored, merging across times, surveys, or questions, and transforming data into something that can be modeled. This paper aims to improve this inefficient process by proposing a distributed storage and data query system specifically designed for augmenting survey data. The proposed system enables survey researchers to streamline analysis across multiple surveys that would have been extremely labor-intensive, if not impossible, with traditional tools.

To begin designing this system, we first ask the fundamental question: how can we best store big data from multiple surveys to optimize information retrieval? Survey data is inherently both relational (multiple respondents answer a common set of questions) and temporal (responses to the same question can vary over time). Thus, we build upon the advantages of NoSQL for bulk relational storage and InfluxDB for time series storage to design a distributed system comprising four main data models: survey data, question data, response data, and respondent paradata. By designing a data storage solution from the ground up, we are able to optimize the query process across multiple surveys.

Upon creating a simple search term such as, “Trump approval”, the system then uses Elasticsearch to parse through survey questions, collecting the matching questions across multiple surveys until a single merged survey, or murvey, is created. Each search is conducted in parallel across the storage nodes to optimize retrieval time. Basic data cleaning, filtering, and post-survey adjustment can be performed before returning the aggregated response data to the user.

Whether it’s tracking consumer confidence, presidential approval ratings, or customer engagement scores, the scope of online surveys continues to grow rapidly. By creating a new distributed system for the storage and querying of big survey data across multiple surveys, we not only offer a scalable solution as the data grows, but also offer the potential to dramatically reduce the time needed to turn raw survey data into actionable insights.


Comparing Coding of Interviewer Question-Asking Behaviors Using Recurrent Neural Networks to Human Coders

Mr Jerry Timbrook (University of Nebraska-Lincoln) - Presenting Author
Dr Adam Eck (Oberlin College)

Download presentation

Surveyors often employ standardized interviewing procedures to reduce interviewer effects on data quality (Fowler and Mangione 1990). Thus, assessing whether or not interviewers read survey questions exactly as worded (a tenet of standardized interviewing) is an essential component of evaluating interviewers' performance. Behavior coding the interviewer-respondent interaction is a common method for identifying these deviations from standardized interviewing (Fowler and Cannell 1996; Schaeffer and Maynard 1996). However, manual behavior coding of an entire survey can be expensive and time-consuming, limiting its use. The rise of machine learning techniques such as Recurrent Neural Networks (RNNs) may offer surveyors a way to partially automate the behavior coding process, saving both time and money. RNNs learn to categorize sequential data, such as speech within a conversation, into categories based on patterns learned from previously categorized examples. Accounting for the sequential nature of speech data allows RNNs to categorize based on semantic context. This is important when coding interviewer question-asking behaviors, as differences between exact question readings, questions asked with minor changes (i.e., changes not affecting question meaning), and questions asked with major changes (i.e., changes affecting question meaning) are often nuanced. Yet the feasibility of an automated RNN-based behavior coding approach and how the accuracy of this approach might compare to manual human-based behavior coding are unknown.

In this research, we compare coding of interviewer question-asking behaviors by manual human coders to the coding performed by RNNs. We use data from Work and Leisure Today II, a dual-frame CATI survey (899 completes; AAPOR RR3=7.8%). Each interview was transcribed and manually behavior coded by humans at the conversational turn level (180,165 total turns, with n=47,900 question-asking turns) to identify when interviewers asked questions exactly as worded, asked questions with minor changes, or asked questions with major changes. With the same interview transcripts, we train RNNs to classify interviewer question-asking behaviors into these same categories using a random subset of the transcripts as learning examples. We then evaluate the quality of the behavior coding of the RNNs on the remaining transcripts.

Inter-coder reliability (kappa) of the manually coded dataset was assessed by graduate research assistants who master-coded a 10% random subset of the manually coded transcripts. Preliminary results from the manually coded dataset indicate that interviewers asked questions exactly as worded 72% of the time, asked questions with minor changes 15% of the time, and asked question with major changes 13% of the time (kappa=.61). We calculate the classification accuracy rates of each trained RNN (a common machine learning evaluation metric) using the master coded transcripts, and compute new kappa reliability scores between the master coders and the RNN coders. Then, we compare the kappa values of the manually coded dataset to the RNN-coded dataset. We conclude with implications for behavior coding telephone interview surveys using machine learning in general, and RNNs in particular. We also consider future applications of machine learning to survey research.


The ODISSEI Data Platform

Final candidate for the monograph

Dr Thomas Emery (NIDI) - Presenting Author

The Open Data Infrastructure for Social Science and Economic Innovations (ODISSEI, http://www.odissei-data.nl/) is the Dutch national infrastructure for the social sciences. The aim of the infrastructure is to coordinate social science data collection efforts within the Netherlands and ensure their integration with e-infrastructures and alignment with the research aims of the scientific and policy making communities.

At the heart of ODISSEI is the development of the ODISSEI data platform which brings together three crucial components that are required for cutting edge research in the social sciences. Firstly, the platform provides researchers with secure remote access to depersonalized administrative data on every individual in the country, held by Statistics Netherlands (CBS, https://www.cbs.nl/en-gb/our-services/customised-services-microdata/microdata-conducting-your-own-research/microdata-catalogue). Secondly, social scientists can import their own data from surveys or other sources and link these to administrative data, enriching it with diverse and complex data forms that are necessary for scientific research. The Netherlands has high quality persistent identifiers for persons and entities that make this process very straightforward and reliable. Finally, the platform is situated within a high-performance computing environment provided by SURFsara (https://www.surf.nl/en/about-surf/subsidiaries/surfsara/), which is the Dutch high-performance computing facility for science and industry.

In this presentation we will provide a technical overview of the ODISSEI data platform and its development through its initial prototype and piloting stages. This will detail the security elements of the platform's design and an illustration of the user access process. We will also introduce several use cases drawn from a broad range of disciplines. Each use case has been developed to showcase the diverse functionality of the platform and its unique ability to bring together sensitive administrative data, high quality scientific data and high-performance computing in a single user environment. For example, the first project to conduct analysis using the ODISSEI data platform combined administrative records with detailed genetic data to examine the contextual determinants of depression. In this new HPC environment a first record linkage project of administrative and genetic variant data has been carried out to run genome wide association studies (GWAS). This is only possible with a secure high-performance data platform.