BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Big data enhancements to surveys: Methods and tools

Moderator: Stas Kolenikov (
Slack link
Quick Zoom

Detailed zoom login information
Friday 27th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Linking Twitter and survey data: Quantity and its impact

Mr Tarek Al Baghal (University of Essex)
Dr Alex Wenz (University of Mannheim) - Presenting Author
Dr Luke Sloan (Cardiff University)
Dr Curtis Jessop (NatCen)

Linking survey and social media data has the potential to be a unique source of information for social research. While the potential usefulness of this methodology is widely acknowledged, very few studies have explored methodological aspects of this data linkage, although initial research has focused on ethical concerns and possible consent. This study explores the outcomes of actual data linkage, in particular the amount of data available to link to surveys, particularly in a longitudinal setting. The amount of data collected from social media can vary across individuals; systematic differences across respondent characteristics may suggest possible information bias in the data. To study this, we link the Twitter data of respondents to two longitudinal surveys representative of Great Britain, the Innovation Panel and NatCen Panel. We address the issues in collecting this data from Twitter, and then the amount of data collected from the site and how it may impact analyses. The amount of data available varies greatly across people, in regards to the number of tweets posted, but also to the number of followers and friends respondents have. Multivariate of both data sources show only a few of the characteristics are statistically significant, with number of followers being the strongest predictor of posting in both panels, women posting less than men, and some evidence that more educated post less, but in only the IP. We conclude by analyzing sentiment of tweets, showing respondents tweeting more have in some instances stronger correlations between their tweets and related survey indicators, although not in all instances. Results suggest potential for research, with further methodological study warranted.

Using big data for sampling design and weighting applications: Quantitative user research at Twitter

Ms Xian Tao (Twitter) - Presenting Author

Various surveys are conducted by user researchers to provide reliable measures across consumer initiatives and other perceptual metrics tied to user experience and product design. Selecting a sample design for a survey needs to tailor the objective(s) of study, such as the target population, key study variables and population parameters to be estimated. This paper demonstrates sampling and weighting methods used by quantitative user research at Twitter, including construction of sampling frame by selecting proper frame variables using big data, determining target allocation, weighting adjustments to minimize biases caused by sampling variation, non-response, undercoverage issues and etc. Two cases studies are shown to illustrate the implementation of the sampling and weighting methods and show how big data are used to customize the frame variables for specific studies.

Evaluating data quality in the UK probability-based online panel

Dr Olga Maslovskaya (University of Southampton) - Presenting Author
Professor Gabriele Durrant (University of Southampton)
Mr Curtis Jessop (NatCen)

Relevance and Research Question: We live in a digital age with high level of use of technologies. Surveys have also started adopting technologies for data collection. There is a move towards online data collection across the world due to falling response rates and pressure to reduce survey costs. Evidence is needed to demonstrate that the online data collection strategy will work and produce reliable data which can be confidently used for policy decisions. No research has been conducted so far to assess data quality in the UK NatCen probability-based online panel. This paper is timely and fills this gap in knowledge. This paper aims to compare data quality in NatCen probability-based online panel and non-probability panels (YouGov, Populus and Panelbase). It also compares NatCen online panel to the British Social Attitude (BSA) probability-based survey on the back of which NatCen panel was created and which collects data using face-to-face interviews.
Methods and Data: The following surveys will be used for the analysis: NatCen online panel, BSA Wave 18 data as well as data from YouGov, Populus and Panelbse non-probability based online panels.
Various absolute and relative measures of differences will be used for the analysis such as mean average difference and Duncan dissimilarity Index among others. This analysis will help us to investigate how sample quality might impact on differences in point estimates between probability and non-probability samples.
Results: The preliminary results suggest that there are differences in point estimates between probability- and non-probability-based samples. The detailed results of this analysis will be available in January 2020.
Added value: This paper compares data quality between “gold standard” probability-based survey which collects data using face-to-face interviewing, probability-based online panel and non-probability based online panels. Recommendations will be provided for future waves of data collection and new probability-based as well as non-probability-based online panels.

Photos instead of text answers: An experiment within a housing survey

Mr Goran Ilic (Utrecht University)
Dr Peter Lugtig (Utrecht University)
Dr Joris Mulder (CentERdata, Tilburg University)
Professor Barry Schouten (Statistics Netherlands and Utrecht University) - Presenting Author
Dr Maarten Streefkerk (CentERdata, Utrecht University)

In general population housing surveys, respondents may be requested to give descriptions of their indoor and outdoor housing conditions. Such conditions may concern the general state of the dwelling, insulation measures the household has implemented to reduce energy use, the setup of their garden, the use of solar panels and the floor area. Part of the desired information may be burdensome to provide or may be non-central to the average respondent. Consequently, data quality may be low or sampled households/persons may decide not to participate at all. In some housing surveys, households are asked to give permission to a housing expert to make a brief inspection and evaluation. Response rates to such face-to-face inspections, typically, are low.
An alternative to answering questions may be to ask respondents to take pictures of parts of their dwelling and/or outdoor area. This option may reduce some burden and may improve data richness, but, obviously, may also be considered intrusive.
In this paper, we present the results of an experiment in which a sample of households from the Dutch LISS panel was allocated to one of three conditions: only text answers, only photos, a choice between text answers and photos. Respondents were asked to provide information on three parts of their house: their heating system, their garden, and their favorite spot in the house. We present the feasibility of this alternative form of data collection and the implemented measures to ensure the respondents privacy. Moreover, we present response rates and discuss data quality of the text answers and photos.