BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


How official statistics can benefit from crowdsourced and administrative data

Moderator: Stas Kolenikov (
Slack link
Quick Zoom

Detailed zoom login information
Friday 27th November, 10:00 - 11:30 (ET, GMT-5)
7:00 - 8:30 (PT, GMT-8)
16:00 - 17:30 (CET, GMT+1)

Citizen science for official statistics: dream or reality?

Dr Olav ten Bosch (Statistics Netherlands) - Presenting Author
Dr Sofie de Broe (Statistics Netherlands)
Mr Kris Vanherle (TML Leuven)
Dr Ben Laevens (Ministry of Economic Affairs and Climate Policy, the Netherlands)

Download presentation

In today’s datafied society people increasingly start measuring phenomena themselves. Especially those that are of particular interest to them, such as the living conditions in their neighbourhood, the energy performance of their house and their own sport performances are popular topics. In many cases the measuring devices used for such activities automatically upload the data to one or more citizen science portals where it is published - usually anonymously - as open data. This poses the question whether this data could be of use for official statistics and, if so, whether the official statistics community should be more active in citizen science communities.

A definite frontrunner in the citizen science area is the air quality community. Over the past 10 years numerous projects were executed where citizens installed an air quality sensor adding data to some central node which collects and displays the data streams as open data. One particular successful project is the Luftdaten project[1]. Starting from a relatively small project in Stuttgart it now comprises measurement stations spread across Europe delivering their data to the central hub. Bodies responsible for official air quality monitoring started to take the citizen science data as an additional input for their models, which proves the value in both the initiative as well as the data.

Another example where citizens have a dominant role in data collection is the use of smart cameras for traffic measurements found in the Telraam project [2] in Leuven, Belgium. Citizens attach a low resolution camera to the front window of their homes and the embedded open source software - running on a raspberry pi - calculates hourly traffic counts and speeds of bicycles, cars, trucks and pedestrians passing by. The software processes the images from the camera immediately so that they are never stored which prevents privacy problems. The continuous data streams make it possible to show live figures and typical averages per time unit (e.g. hours, weekday, working day/weekend etc.) and changes of averages over time. These data have shown to be of use for many studies in regional mobility patterns.

Yet another example of citizen generated data can be found in the Australian PVOutput portal [3]. Citizens or organisations from all over the world - including the Netherlands - connect their solar panel systems to this data portal so that they can monitor and benchmark the power production of their equipment. Research has shown that these high frequency measurement data can be used to build an advanced model to estimate the solar energy generated per region and per day which is a major improvement to what could be published without this citizen science data.

In this paper we dive into various citizen science initiatives from the viewpoint of official statistics. We reflect on the value of the data, the representativity issues contained and we philosophize about the role a statistical office could play in these communities.


Using geo-located tweets to calibrate citizen science sampling effort and identify inequalities

Mr Jonathan Kent (Universitat Pompeu Fabra) - Presenting Author

Big Data generated by citizen scientists has already transformed ecological research but comes with some inherent disadvantages. For projects which seek to estimate animal or insect population trends and spatial distributions, the scale of citizen science data is invaluable, but it is collected by volunteers while they go about their daily lives. As such, the distribution of observation sites is neither random nor systematic, and observations only occur within the activity spaces of citizen scientists, who are more likely to be highly educated and male (Ganzevoort et al., 2017). Concerningly, low-SES geographic areas are underrepresented in such studies (Hobbs and White, 2012).

These inequalities are especially worrying when citizen science is employed to track disease threats. Sightings of tiger mosquitoes and yellow fever mosquitoes reported to the Mosquito Alert app are used to model the probability of future sightings, especially along the Mediterranean coast of Spain, where the app’s user base is largest. The model controls for sampling effort (Palmer et al., 2017), but biases in the mobility patterns of app users are difficult to assess without reliable data on the daily mobility of the population of Spain at large. A second source of Big Data—Twitter—may help identify such biases and better calibrate the predictive model.

Twitter gives users the option to tag each of their tweets with their precise longitude and latitude coordinates at the time of sending. While a small fraction of users have opted in to this feature, the scale of Twitter allows for a very large dataset nevertheless. Wang et al. (2018), for example, gathered over 650 million tweets geotagged in 50 US cities over 18 months. Though the usership of Twitter has its own inherent biases, geo-located tweets are a global source of human mobility data with which citizen science data can potentially be calibrated. This paper compares the range and spread of users of both platforms along with the land use and population characteristics of users’ activity spaces. To assess how well geo-tagged tweets approximate population mobility, we will make an additional comparison in one city, Barcelona, where another project aims to collect a representative sample of smartphone-collected human mobility data.

This paper contributes in two ways. First, it will assess the feasibility of using geo-located tweets to validate similar citizen science data or other non-representative human mobility data. Second, it will identify inequalities that may exist in Mosquito Alert’s coverage area, such as in lower-income neighborhoods or industrial zones. This will allow for more accurate predictions of mosquito risk for residents and workers.

Using administrative data and machine learning to address nonresponse bias in establishment surveys

Mr Benjamin Küfner (Institute for employment research (IAB)) - Presenting Author
Mr Joseph Sakshaug (Institute for employment research (IAB))
Mr Stefan Zins (Institute for employment research (IAB))

In recent years, participation rates have been declining for the IAB Job Vacancy Survey (JVS), one of the largest establishment surveys in Germany, which mainly aims to quantify the size of the unfilled labor demand. The declining response rates pose a risk for increasing non-response bias. In establishment surveys, non-response analyses and non-response correction methods are usually limited to only a few auxiliary variables and used in simple models. We overcome these limitations by using the Establishment History Panel (BHP), a rich administrative data set on the population of all employing establishments in Germany covering a variety of employer and employee profile characteristics. This enables us not only to test more theory-driven hypotheses, but also to build data-driven models based on machine learning to test for deeper interactions and more complex nonresponse patterns. In the final step of our analysis, we use our estimated response models to construct establishment-level weights to evaluate model performance with respect to non-response bias reduction. For the evaluation, we present aggregate bias measures for the long-run trend in addition to individual estimates of bias. The results from these analyses give insights to what extent we can measure and potentially reduce non-response bias in establishment surveys using a rich administrative data set with theory-based and data-driven machine learning methods. In addition, our paper will shed light on whether these additional tools are worthwhile for reducing nonresponse bias, and furthermore provide a blueprint for other establishment surveys on how they might use big data methods to improve their nonresponse adjustment procedures.

Crowd sourcing: a viable alternative to traditional price data collection methods?

Dr Halfdan Lynge (Wits School of Governance) - Presenting Author

In Mozambique, the consumer price index (CPI) is based on price data collected by the national statistical service, Instituto Nacional de Estatística (INE). Every month, INE sends out enumerators, who collect price data on key food items. Mozambique is a large country with poor infrastructure. Collecting data is expensive, particularly in the rural areas. Because of that, INE collects price data only in the four largest cities: Maputo, Matola, Beira, and Nampula. The rest of the country is excluded. We know that food prices and inflation are highly localised. Yet, the government is no information about food prices and inflation in the rural areas, where two thirds of the population live. This significantly reduces its planning capabilities and makes it difficult to develop evidence-based rural development policies.
The Ministry of Economics and Finance is trying to address this. With the support of the the United Nations University (UNU) - World Institute for Development Economics Research (WIDER), it has contracted local startup Sauti to develop a price data crowd sourcing platform. The platform, which draws on GSM technology to minimise, will be piloted in two provinces over a six month period. Every two weeks, Sauti will send out bulk SMSs, inviting people to provide price data on key food items. In exchange, respondents will receive a small monetary reward, uploaded to directly to their mobile phones.
The proposed will summarise the experiences, focusing on (a) the platform itself, (b) the effectiveness of the reward scheme, and (c) the quality of the data. In areas, where official price data exist, the collection methods will be compared with a view fo assessing whether crowd sourcing methods offers a viable, low cost, and more fine grained alternative to traditional data collection methods.

Sensor measurements for living and working conditions

Professor Barry Schouten (Statistics Netherlands and Utrecht University)
Professor John Bolte (The Hague University of Applied Sciences) - Presenting Author

On April 16 and 17, 2020, the third edition of the sensor data challenge was held by The Hague University of Applied Sciences, Statistics Netherlands, Utrecht University and the National Institute for Public Health and the Environment. The Sensor data challenge provides hardware (various sensors, raspberryPi) and software to teams with a mix of expertise in electrotechnics, mechatronics, data science, user experience and industrial design. Teams need to design a tool and demonstrate its feasibility and relevance for one of the presented challenges.
The third challenge had sensor measurements for living and working as the central theme. Winning solutions of the two previous editions have been the starting point for large-scale on-going research projects. We like to present a brief summary of the solutions presented by the participating teams at the third challenge and offer the winning team the opportunity to share and discuss their ideas at BigSurv20 with a larger audience.