BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Missing Data? No Big Deal Using New Big Data Methods

Chair Dr Paul Biemer (RTI International)
TimeFriday 26th October, 14:15 - 15:45
Room: 40.035 S. Graus

Motivated Misreporting in Crowdsourcing Tasks of Content Coding, Image Classification, and Surveys

Dr Yuli Hsieh (RTI International)
Ms Herschel Sanders (RTI International)
Ms Amanda Smith (RTI International)
Dr Stephanie Eckman (RTI International) - Presenting Author

Download presentation

Crowdsourcing has become a popular means to solicit assistance for scientific research. From classifying images or texts to responding to surveys, tapping into the knowledge of crowds to complete complex tasks has become a common strategy in social and information sciences. Although the timeliness and cost-effectiveness of crowdsourcing may provide desirable advantages to researchers, the data it generates may be of lower quality for some scientific purposes. The quality control mechanisms, if any, offered by common crowdsourcing platforms may not provide robust measures of data quality. This study explores whether research task participants may engage in motivated misreporting whereby participants tend to cut corners to reduce their workload while performing various scientific tasks online.

We conducted an experiment with three common crowdsourcing tasks: answering surveys, coding images, and classifying online social media content. The experiment recruited workers from three sources: a crowdsourcing platform for crowd workers, a commercial survey panel provider for online panelists, and a research volunteering website for citizen scientists. The analysis seeks to address the following two questions: (1) whether online panelists, crowd workers or volunteers may engage in motivated misreporting differently and (2) whether the patterns of misreporting vary by different task types. We further seek to examine potential correlation between the patterns of motivation misreporting and the data quality of complex scientific research tasks. The study closes with suggestions of quality assurance practices of incorporating collective intelligence to improve the system for massive online information analysis in social science research.


New Data to Correct for Nonresponse Bias: The Case of Administrative Data in Spain

Mr Pablo Cabrera-Álvarez (University of Salamanca) - Presenting Author

In the recent years we have seen how an increasing amount of data are now generated, stored and managed. These new sources of data are also increasingly available for research purposes and the field of survey methodology is not an exception. This paper, which focuses on the Spanish case, aims to address the usability of administrative data to analyse non-response patterns and correct them. To do this, I propose to explore different matching strategies such as deterministic and probability matching techniques.

On one hand, non-response bias, partly due to the constant decrease of response rates in most of the countries, has been a matter of concern in the last decades. The techniques normally employed to tackle non-response and non-coverage biases, once the data is gathered, are weighting (for the absolute lack of response) and multiple imputation (for partial non-response). For these techniques to succeed, the auxiliary variables, available for respondents and non-respondents, included in the models, need to be correlated with both the response propensity and the target outcome. Then, the key is to find these variables that can rebalance the sample for a specific estimate.

On the other hand, beyond the debate of weather big data will be capable of substituting survey research one day, scholars have highlighted the opportunities of big data combined with surveys. For instance, Smith (2011) proposes four different uses of auxiliary data in the framework of the Multi-Level Integrated Database Approach (MIDA): 1) non-response and non-coverage analysis and adjustments; 2) supporting data collection; 3) substantive analysis and; 4) interview validation. One of the most appreciated sources of data is administrative data, because of its completeness and wide scope. The counterpart is the difficulty of access for users out of the national institutes and the new sources of error appearing from the data matching process.

This research uses open administrative data in Spain aggregated and at micro level (anonymised) to explore whether it can be used to analyse and correct for non-response out of the field of official statistics. Survey data from an internet probability panel will be used to carry out the feasibility analysis using different matching techniques (e.g. deterministic and probabilistic). The feasibility analysis includes systematic comparisons of the survey and administrative data sources using benchmark indicators. This study will also present information about the matching process in order to assess the difficulties of carrying out this process in a country such as Spain.


Health Survey Non-Representativeness Bias Methodology and Validation

Dr Linsay Gray (MRC/CSO Social & Public Health Sciences Unit, University of Glasgow) - Presenting Author
Ms Megan Yates (MRC/CSO Social & Public Health Sciences Unit)
Dr Tommi Härkänen (National Institute for Health and Welfare (THL), Helsinki)
Dr Oarabile Molaodi (MRC/CSO Social & Public Health Sciences Unit)
Dr Hanna Tolonen (National Institute for Health and Welfare (THL), Helsinki)
Professor Alastair Leyland (MRC/CSO Social & Public Health Sciences Unit)
Professor Pekka Martikainen (Department of Sociology, University of Helsinki)

Download presentation

The reliability of estimates from surveys is dependent on how representative their data are of the general population. Representativeness can be threatened by compromised survey participation levels which have been declining in recent decades. This decline is widely recognised as a considerable and growing problem since the standard means of survey weighting of respondent data, usually based on sociodemographic characteristics, does not generally adequately correct for differences between participants and non-participants within sub-groups.
We have recently developed an advanced methodology to address survey non-participation, with an exemplar application to correct alcohol consumption estimates in Scotland. This utilises record linkage of health survey participants to hospitalisation and mortality files, comparing alcohol-related hospitalisations/deaths ("harms") as well as sociodemographics to those in the general population to allow inference on non-participants. We then use this inference as the basis of generated partial synthetic observations for non-participants with the corresponding alcohol-related harm rates in demographic sub-groups and use multiple imputation to reliably fill-in their “missing” alcohol measurements. We take this approach rather than simply incorporating harms into survey weights, since it has the flexibility to accommodate differential non-participation patterns within subgroups.

Such indirect inference on the non-participants is necessary in the absence of direct information. However, with the register-based systems in Finland (as for other Nordic countries) non-participation bias can be assessed in direct comparison of participants and non-participants since the sociodemographic characteristics of the non-participants are already known at the individual level and their health outcomes are identifiable via record-linkage. Of course, the ability to also make the indirect comparisons with the general population, as we have thus far, holds in Finland too. We can therefore assess the validity of our existing approach by comparing estimates derived from both approaches.

We aim to validate our methodology for dealing with survey non-participation by means of the Finnish Health 2000 survey which has record linkage to hospitalisation and mortality files for non-participants as well participants, with follow-up to 2015. For both participants and non-participants we have: age-group, sex, socioeconomic measures, and subsequent alcohol-related harms. Reference is made to a separate aggregated 11% sample of the contemporaneous Finnish general population with counts of alcohol-related harms through to the end of 2015.

We can: 1) Quantify differences between survey participants and non-participants in terms of alcohol-related harms; 2a) Use multiple imputation to obtain values for alcohol consumption measures in non-participants; 2b) Apply the developed methodology based on comparisons of participants and the contemporaneous general population; and 3) Compare the overall alcohol consumption estimates in 2a) vs 2b).

We will achieve insight into the extent to which our methodology – with its absence of detailed information on non-participants – is robust. The focus is on refining measurement of alcohol consumption but the methodology has wider applicability to, for instance, cigarette smoking and obesity whenever survey data have been record-linkage to administrative data.


Experiences in FBI's NCS-X NIBRS Estimation Project

Final candidate for the monograph

Dr Dan Liao (RTI International) - Presenting Author
Dr Marcus Berzofsky (RTI International)
Mr Ian Thomas (RTI International)
Mr Lance Couzens (RTI International)
Dr Alexia Cooper (Bureau of Justice Statistics)

Download presentation

One of the big challenges with big data is ensuring the data are of proper quality, representation, and completeness for analysis. In this presentation, we describe the methods used to assess and address these issues developed for a large administrative source of crime data. The FBI’s National Incident Based Reporting System (NIBRS) is an incident-based reporting system used by law enforcement agencies (LEAs) in the United States for collecting and reporting a variety of information on each single crime incident. However, only around 6,600 of the nearly 18,000 LEAs in the United States report their crime and arrest data using NIBRS. To generate national estimates based on NIBRS data, the National Crime Statistics Exchange (NCS-X) Initiative is in the process of transitioning 400 selected non-reporting LEAs to NIBRS. This LEA sample was selected from across the country and is designed to allow for nationally-representative estimation. The NCS-X NIBRS estimation project aims to develop a two-phase process that combines data collected from the 400 LEAs with the NIBRS data reported by the current 6,600 LEAs to produce timely, accurate, and detailed national measures of crime and arrest. The first phase will include developing and testing statistical procedures for data quality and completeness assessment, unit and item nonresponse adjustments and final statistical estimation to generate national estimates. The second phase will produce a prototype of an automated system that can generate a data resource or set of resources to support producing timely national estimates.

In this presentation, we will share our experiences in dealing with challenges in the NCS-X NIBRS estimation project. The data elements in NIBRS are organized in a complex way to reflect many different aspects of a crime incident, such as the offence types, characteristics of the victim(s) and offender(s), types and value of property stolen and recovered, and characteristics of arrestee(s). We will present our strategies to organize and manipulate this large-scale and complex-designed data in conjunction with other auxiliary data sources. We will also present methods developed to compensate for noncoverage and nonresponse in NIBRS while accounting for the complexity of the data structure and the feasibility of producing the automated prototype in the second phase. These experiences can be extended to the utilization of other administrative data sources for statistical purposes.