BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Properties of organic data and their integration with traditional surveys

Moderator: Peter Lugtig (
Slack link
Quick Zoom

Detailed zoom login information
Friday 6th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Can social media data complement traditional survey data? A reflexion matrix to evaluate their relevance for the study of public opinion.

Ms Maud Reveilhac (Lausanne University, Switzerland, Faculty of Political and Social Sciences, Institute of Social Sciences, Life Course and Social Inequality Research Centre) - Presenting Author
Ms Stephanie Steinmetz (Lausanne University, Switzerland, Faculty of Political and Social Sciences, Institute of Social Sciences, Life Course and Social Inequality Research Centre)
Mr Davide Morselli (Lausanne University, Switzerland, Faculty of Political and Social Sciences, Institute of Social Sciences, Life Course and Social Inequality Research Centre)

Relevance & research question:
Traditionally, public opinion (PO) has been measured through probability-based surveys, which are considered the gold standard for generalising findings due to their population representativeness. The turn to social media data (SMD) to gauge PO in recent years has however led to a discussion about the potential for augmenting or even replacing survey data. Survey and social media researchers have, therefore, increasingly explored ways in which social media and survey data are likely to yield similar conclusions. In this context, two core challenges can be identified: i) researchers have mainly emphasised on replacement of survey data by SMD, rather than on the complementarity between both data sources; ii) there are currently two understandings of PO, which makes complementarity of both data sources quite difficult. As a result, researchers still need more guidance on to best complement SMD with survey data.
(Methods & Data):
Whereas the recent extension of the Total Survey Error framework to SMD is an important step to account for the quality of both data sources, we would like to propose an addition step that should come ideally before the discussion and evaluation of the quality of the collected data. Building on four key challenges, we develop a reflexion matrix to provide practical guidelines dealing with the complementarity of both data sources for the study of PO.
Our results convey two main take-home messages: i) we demonstrate that the main approach validating what we have found via surveys using SMD is problematic as survey measures convey an idea of simplicity and aggregation, whereas SMD are complex and multi-dimensional; ii) we provide researcher with an orientation of how SMD can be a potential complementary source to survey data.
Added Value:
We argue for the necessity to develop different and complementary views of PO if conducting research with a mixed-method approach, where complementarity of the data sources is one essential criteria. In addition, we point to possible solutions from other disciplines which have been little considered in studies of PO yet.

Integrating organic data and designed data for higher quality measurement: Overcoming coverage limitations of big data

Dr Leah Christian (Nielsen) - Presenting Author
Mrs Kay Ricci (Nielsen)

Many big data sources cover only a portion of the population researchers are interested in measuring. For example, social media data only includes uses of that specific social media site and data from devices such as Smart TVs or Android mobile phones only cover users of those specific devices. One of the first key steps to using big data is to estimate what portion of the population of interest the big data source covers, generally using a designed sample that provides full coverage of the population. For example, to understand Twitter or Samsung Smart TV coverage, one must first understand what portion of the population uses that service/device. In addition, dimensioning the characteristics or behaviors of those people and how they are different from others in the population of interest on key measures is also helpful. The best source of coverage information about big data is often from a high quality designed data source.

This paper will focus on sharing approaches Nielsen has used to integrate big data sources with existing data from designed panels to produce higher quality measurement of people’s media consumption and behaviors. The volume of the big data sources helps provide stability and reduce variance, as well as provide the ability to measure lower incidence behaviors that may not be detected in data sources with smaller sample sizes. Since each source is limited in its coverage, our approach integrates big data with Nielsen designed panel data, rather than use big data alone, to provide a more robust overall measurement. We will share specific examples of how designed panel data can be used to create estimates and corrections for coverage limitations in big data sources. In addition, we will share approaches for integrating the data from multiple big data and designed data sources to produce estimates of media behavior.

Towards a total error framework for sensor and survey data

Mr Lukas Beinhauer (Student at UU & intern at CBS) - Presenting Author
Dr Ger Snijkers (Senior Methodoloog at CBS)
Mr Jeldrik Bakker (Methodoloog at CBS)

With the onset of Big Data, researchers have been looking into the opportunities offered by various new data sources. One of these research fields focusses on new data sources as input for official statistics. These data include sensor data. At Statistics Netherlands, a project was defined on the usage of water data, collected by a number of companies having a role in the water cycle, including the production of drinking water, use of water by households and companies, and cleaning of sewage water. These data include survey data, register data, as well as sensor data. Like with survey and register data, it is crucial that the quality of sensor data is assessed in order to get an idea about bias and accuracy. In this presentation we will discuss a total error framework, developed for use on sensor (and similar) data. This framework is based on the Total Survey Error framework as proposed in Groves (2009), and extended by Zhang (2011) for secondary data sources like registers.

The life cycle of integrated survey data, in Zhang (2011), consists of two phases. The first phase - phase 1 - refers to errors stemming from the data collection process in itself. These are rather traditional measurement and representation shortcomings, as discussed in Groves (2009). Phase 2 refers to errors stemming from the combination of survey and secondary data sources. The newly proposed total error framework extends by adding an additional phase before the first phase - phase 0. The errors discussed here purely relate to issues and barriers limiting the quality within the sensor data collection process. Similar to Zhang (2011) and Groves (2009), distinction is made between measurement and representation lines of process.

The proposed total error framework is developed theory-driven, combining ideas from various schools of science dealing with survey, register and sensor data. A large emphasis is put on availability and extensiveness of metadata. Secondary use of data is commonplace nowadays, therefore metadata is crucial for the analyst to handle sensor (or any kind of) data appropriately. In addition, a number of quality metrics for phase 0 are suggested to quantify the quality of sensor data. These metrics include e.g. stability of measurements, occurrences of blocks of missing data, and similarities in measurement-patterns between sensors. Framework and metrics are tested on real-life data that became available in the Statistics Netherlands’ water data project, and subsequently adapted in a data-driven approach.

In the presentation we will discuss the total error framework we have developed for these data, as well as the developed metrics to assess the quality of sensor data. Application of the framework may help in ensuring unbiased and reliable statistics.

Is bigger always better? Evaluating measurement error in organic TV tuning data

Ms Kay Ricci (Nielsen) - Presenting Author
Dr Leah Christian (Nielsen)

Although often having high coverage and data quality, panel measurement is typically limited in size due to the complexity and cost of recruiting and maintaining high quality panels. These smaller sample sizes can introduce variability that make it challenging to estimate lower incidence behaviors and trend data over time. Survey or designed data can be supplemented with big data, which can provide volume and stability to estimates, but these organic data sources are not without measurement challenges. Big data sources may be subject to specific types of measurement error, such as issues with the devices/instruments, invalid or fraudulent data, and missing data. Thus, it is imperative to compare big data sources to other data sources in order to understand and correct for these types of measurement error.

Our presentation will explore measurement error in big data sources that measure media consumption, including tuning data from Smart TVs and set-top boxes from specific TV brands and cable/satellite providers. In an effort to dimension the potential bias in these data, Nielsen uses its nationally representative, probability-based TV panel as a calibration source. Each TV set within these panel homes is metered, enabling continuous measurement of tuning behavior throughout the day. Nielsen uses a double-blind approach to identify panel homes also present in a Smart TV or set-top box provider’s data set, and these common homes are used to compare minute-level tuning from each data source. These side-by-side comparisons for the tuning measured to each TV set have uncovered systematic measurement errors in the provider tuning data related to missing tuning minutes, incorrectly identified TV stations, inaccurate viewing times, and false tuning events. In addition to identifying data challenges, common homes are also critical in the development and testing of provider-specific models that improve the quality of the data and filter out invalid tuning prior to integration with our panel-based measurement. This presentation will cover the findings from our analyses as well as Nielsen's overall approach to big data calibration. Most importantly, our discussion will emphasize the need to evaluate measurement error within organic data and the essential role that designed data can play in making big data usable for quality measurement.

Multivariate density estimation by neural networks

Ms Dewi Peerlings (Maastricht University and Statistics Netherlands) - Presenting Author
Professor Jan van den Brakel (Maastricht University and Statistics Netherlands)
Dr Nalan Bastürk (Maastricht University)
Dr Marco Puts (Statistics Netherlands)

The availability of 'organically' measured data, such as social media data and road sensor data, is increasing over the recent periods. Such data, in contrast to 'designed' data collected e.g. from surveys, are often generated and collected at a relatively high frequency, hence resulting in a large amount of data that is accessible. However, the amount of information in such organically measured data is often smaller than designed data: The considered datasets are typically recognized by high volatility, missing observations and sample selectivity, where the signal indicates the information content. The signal, indicating the information content, needs to be filtered and seperated from the noise. A typical way to proceed in separating the signal and noise is filtering, where the adequate information content with lower volatility, the signal, is extracted, the effect of outliers are reduced and missing observations are imputed using the filter. Following this filtering process, which requires an assumption about the probability density function (PDF) of the underlying data generating process (DGP), the data can be used for real time predictions. These predictions solve the issue of high volatility and missing observations that characterize the dataset, which is generated as a by-product of processes not related to statistical production purposes. In this way, these cleaned datasets can potentially be used for official statistics.

In this paper, we propose a new method to obtain the probability density function (PDF) to assess the properties of the underlying DGP without imposing any assumptions, using artificial neural networks. The proposed artificial neural networks have additional advantages compared to well-known parametric and non-parametric density estimators. Our approach builds on the literature on cumulative distribution function (CDF) estimation using neural networks. We extend this literature by providing the analytical derivatives of the obtained CDF from the artificial neural network. Our approach hence removes the approximation error in the second step of obtaining the PDF from the CDF output, leading to more accurate PDF estimates. We show that the proposed solution to obtain the PDF from the CDF output of an artificial neural network with several hidden layers holds in a multivariate setting for correlated variables and for a neural network with several hidden layers. We illustrate the accuracy gains from our proposed method using several simulation examples, where the real DGP is known, hence the improvements in accuracy compared to existing methods can be assessed. We follow the illustrations in continuous data cases of related literature, a discrete data application and a multivariate data application. More specifically, the following simulation cases are considered: a univariate mixed normal distribution, a univariate mixed generalized extreme value distribution, a univariate mixed & bivariate mixed correlated poisson distribution and a bivariate & trivariate correlated standard normal distribution.