BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Modes of data collection and their effect on data linkage results

Moderator: Ana Lucia Cordovar (
Slack link
Quick Zoom

Detailed zoom login information
Friday 27th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

How and why does the mode of data collection affect consent to data linkage?

Professor Annette Jackle (University of Essex) - Presenting Author
Dr Jonathan Burton (University of Essex)
Professor Mick Couper (University of Michigan)
Professor Thomas Crossley (European University Institute)
Ms Sandra Walzenbach (University of Essex)

There is increasing demand for survey data linked to administrative data. At the same time, surveys are increasingly using mixed-mode data collection. These trends present challenges: the rates of consent to data linkage are typically 20 to 40 percentage points lower in self-completion than in interviewer administered surveys. To date not much is known about why survey mode affects consent and what can be done to encourage informed consent without an interviewer.

There are several potential explanations how the survey mode might affect consent to data linkage: (1) interviewers can provide additional explanation and offer reassurance, (2) the presence of an interviewer can lead to social pressures to conform with the request, (3) interviewers control the pace of the interview, such that respondents process the consent request more thoughtfully, and (4) the technology used to administer the questionnaire provides context that influences respondents’ concerns about the security of their data.

In this paper we use experimental data from 2,608 respondents in the Understanding Society Innovation Panel (IP), to examine differences between web and face-to-face surveys in consent outcomes, concerns about data security, and how respondents process the consent request. The IP is a probability sample of households in Great Britain that interviews all adult household members annually. At wave 11 a random two-thirds of households were invited to complete the survey online, with non-respondents followed up by face-to-face interviewers. The other third were first called by interviewers, and then given the option to complete online. We use the randomized allocation to test whether mode differences are due to selection of different types of respondents into modes.

The results suggest that there is a real effect of the mode on how respondents make the consent decision, which is not explained by selection into mode. Similar to previous studies, the online consent rate is 30 percentage points lower than the face-to-face rate. Although web respondents are just as confident in their consent decision, they are less likely to have understood the consent request.

Initial analyses of behaviour coded audio-recordings from the face-to-face interviews suggest that interviewers only rarely offer additional explanations or reassurance. The interviewer involvement does however affect how respondents process the consent request. Web respondents are less likely to use reflective strategies in their decision-making process and instead more likely to base their decision on gut feeling or habit. They take less time to answer the consent question and are less likely to read additional information about the linkage request. The provision of additional information is therefore unlikely to shift web respondents into a more reflective decision-making process or to support their understanding of the request. Web respondents also report higher levels of concern about data security than face-to-face respondents. In a follow-up experiment using an online access panel, priming respondents to think about trust in the organisations involved in the linkage increased the consent rate by 5 percentage points.

Comparing web and telephone surveys with registers data: The case of global entrepreneurship monitor in Luxembourg

Mr Cesare Riillo (STATEC reseaserch ) - Presenting Author

Relevance & Research Question: Failure to predict Brexit and the US election outcome has called into question poll and survey methodology. This study contributes to this debate by assessing the Total Survey Error of two surveys on entrepreneurship. One survey is probability-based and is conducted by fix-line phone. The other survey is based on an opt-in web panel and it is nonprobability based. The same questions are administered in both surveys. The study assesses witch survey resembles better official register data in terms of distribution of socio demographic characteristics of respondents and in terms of variable of interest (entrepreneurial activity).
Methods & Data: This research is based on the Global Entrepreneurship Monitor (GEM) survey data for Luxembourg. GEM interviews individuals to collect international comparable entrepreneurship information. There are two survey designs for GEM survey in Luxembourg: telephone interviews - randomly dialling to fix lines - and web interview- stratified sampling from an opt-in web panel. The research is conducted in three steps. First, I compare the distribution of socio-demographic characteristics of respondents of both survey designs (web and telephone) with official data (census and business demography). Second, the econometric analysis (multivariate regression, Oaxaca decomposition and Coarsened Exact Matching) disentangles difference in entrepreneurial activity in terms of observable and unobservable characteristics in both survey designs. Finally, I test how effectively weighting adjustments correct for the entrepreneurial activity bias.
Results: Results show that both surveys are not perfectly emulating official data. Both telephone and web surveys underestimate the proportion of low educated adults as recorded in census data. Additionally, respondents of fix-line are considerably older than the census population. In terms of entrepreneurship –the main variable of interest-, the fix-line survey underestimates the proportion of adults owning or managing a firm. This proportion is over overestimated by the web survey. Current weighting procedures fail to account for different survey design (probability and nonprobability-based) and do not correct for the bias.
Conclusions and next steps: The study highlights the challenges of survey data collection. Results of both web and telephone surveys differ from official data figures. Weighting procedures that do not account for survey design are not appropriate.

PC versus mobile survey: are people’s subjective evaluations comparable?

Dr Francesco Sarracino (STATEC research)
Dr Cesare Riillo (STATEC research) - Presenting Author
Dr Malgorzata Mikucka (MZES, Mannheim University (Germany), and Universite Catholique de Louvain)

Relevance & Research Question:
The literature on mixed mode surveys investigated whether face-to-face, telephone, and online survey modes permit to collect reliable data. Recent research focused on the effect of using different devices to answer online surveys. Available findings indicate that surveys administered using smartphones suffer more because of longer completion time, straight lining, acquiescence, primacy, break off, and item non-response than PC. In this paper we assess the comparability of subjective answers to an online survey administered via PC and smartphone.

Methods & Data:
We use a unique, nationally representative dataset from Luxembourg collected in 2017 which contains a battery of 20 questions about people's opinion. Respondents were free to choose the device they preferred to answer the online survey. The econometric analysis is done in two steps: first we apply a coarsened exact matching model to avoid that differences in answers are caused by sample composition. Second, we implement an order logit on the matched data because responses range on a scale from 1 to 5.

We find that the choice of device does not systematically affect the answers to subjective questions. We test whether the device effect is heterogeneous investigating its effect on sub-groups of the population defined according to age, gender, education, occupation, language, immigration, income. We do not find any evidence supporting the hypothesis that the tool used to administer the survey affects respondent’s answers to subjective questions.

Added Value:
This evidence suggests that the device is unimportant for collecting people’s opinion, and lends support to those who argue in favour of improving respondent's experience promoting the use of smart-phones.
Keywords: Device mode; mobile survey, statistical matching

Fieldwork conditions, sample quality, and the reliability of dried blood spot samples

Professor Axel Börsch-Supan (Max Planck Institute for Social Law and Social Policy)
Mrs Luzia Weiss (Max Planck Institute for Social Law and Social Policy) - Presenting Author
Dr Martina Börsch-Supan (SHARE-ERIC)
Ms Rebecca Groh (Max Planck Institute for Social Law and Social Policy)

Many socio-economic studies collect information on health to investigate the determinants of health in the respective study population. For this purpose, objective measurements are widely used alongside the self-report on the respondents’ health status. One possibility to obtain objective health information is to analyze blood samples for certain biomarkers. For a very large and cross-national survey such as the Survey of Health, Ageing and Retirement in Europe (SHARE) with about 85.000 respondents in 28 countries, taking venous blood is prohibitively expensive. SHARE therefore implemented the collection of blood in form of dried blood spot samples (DBSS).
Laboratory results from dried blood spot assays cannot be directly compared to the results one would obtain from standard assays of “gold standard” venous blood samples (VBS) even if created under standardized laboratory conditions. Moreover, a variety of fieldwork conditions and sample quality measures, as well as possibly all kinds of interactions between all of these add further variance to the biomarker levels measured in field-collected DBSS. We used data science methods to compute a conversion formula, which translates the DBSS values obtained under fieldwork conditions into gold standard values.
One factor playing a special role in this formula is blood spot size. SHARE therefore documented spot size as photographs for each dried blood spot to be analysed. We also developed a pattern-recognition approach to assess DBSS quality from these photographs, which then enters the conversion formula.
This presentation reports on results, demonstrates that we were able to develop models with a high prediction accuracy for most markers, and shows that we can reliably measure certain biomarkers in field-collected DBSS if fieldwork conditions and sample characteristics are properly taken into account.

What can be predicted from a national health survey? Is cancer one of them?

Dr Yigit Aydede (Saint Mary's University) - Presenting Author

Even after a quarter century of extensive research, researchers are still trying to determine whether cancer is preventable. Cancer is caused by both internal factors (such as inherited mutations, hormones, and immune conditions) and environmental factors (such as tobacco, diet, radiation, and infectious organisms). A major study in 2008 indicates that only 5–10% of all cancer cases are due to genetic defects and that the remaining 90–95% are due to factors related to environment and life-style. The objective of work is to see whether thesel factors can be identified as predictors of cancer with high-dimensional data.

A consideration of predicting cancer or any chronic disease using publicly available data is a taunting thought experiment and, in fact, any attempt may seem to be suspiciously overpromising. However, we have obtained a special permission to access the confidential CCHS (Canadian Community Health Survey) files and Discharge Abstract Database (DAD) and linked them to build a panel dataset, for the first time in Canada, that traces the people in 2001 CCHS for the following 10 years between 2001 and 2011. In developing the panel data, we were able to identify more than 60 thousand people who have no cancer in 2001 and never had a cancer before 2001, of whom around 6 thousand had developed a cancer in the following 10 years. Thus, the data allowed us to use every feature in the survey without having a concern about possible reverse causality problems or model-leaking issues. The level of information of in the data is unprecedented ranging from the amount of weekly carrot consumption to the weekly time in minutes the person spends in Olympic weight lifting workouts. Moreover, with the first-level interactions and the 3rd degree polynomials, its dimension can be extended to more than 130 thousands features. Learning to detect meaningful patterns in large and complex data sets, such as ours, with the help of advance machine learning methods and super computers opens up new horizons.

With the binary outcome, we first tried to see whether cancer, as a common disease, can be predicted with non-medical data and, if it can, what predictors can be identified (for those between 55-75). With our multi-stage algorithms, we could not exceed 62% of prediction accuracy measured. This shows that, when the age is controlled for, there are very few predictors common for all cancer types. In the second step, we used two different outcomes: all cancer types with and without lung cancer. The results show that first-hand and second-hand smoking are not identified as predictors when we exclude respiratory cancer patients from the sample. These initial results imply that smoking is not a common predictor variable for cancer when respiratory cancer types are removed from the data. We also used multinomial LASSO family models for ten different cancer types. Although each type is associated with different and unique sets of predictors, none of them exceed 70% AUC.