BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Ethical Considerations for Using Big Data I: Exploring the Ethics of Data Linkage

Chair Dr Henning Silber (GESIS - Leibniz Institute for the Social Sciences)
TimeFriday 26th October, 11:30 - 13:00
Room: 40.150

Attitudes Towards Data Linkage, Privacy, Ethics, and the Potential for Harm

Final candidate for the monograph

Dr Aleia Fobia (U.S. Census Bureau) - Presenting Author
Ms Jennifer Childs (U.S. Census Bureau)
Dr Casey Eggleston (U.S. Census Bureau)

Download presentation

In the United States, the Commission for Evidence-Based Policymaking advocated for expanded use of data from federal statistical and regulatory agencies to help guide decision-making and ultimately improve programs and policies. Part of this expanded use includes sharing and linking data from various agencies and leveraging other big data sources. Federal systems will likely face challenges stemming from public perceptions of data linkage as well as ethical concerns about harm and transparency. In this study we examine public opinion on data sharing, ethical concerns surrounding data linkage, and concepts of harm. In general, there is evidence that the public is not convinced that government promises of confidentiality will be honored (Anderson & Seltzer, 2007). Often these concerns involve the fear that information is shared between statistical and enforcement government agencies, such as the Census Bureau providing information for terrorist profiling (Hudson 2004) or the provision of information to agencies such as Homeland Security (Clemetson 2004). These concerns could become even more pronounced as the capacity for large-scale data collection, analysis, and linkage becomes more widespread and the public learns about these capabilities.

Both quantitative and qualitative data are analyzed in this study. Quantitative data were collected through questions about public opinion and data linkage on the Gallup Daily Tracking Survey, which consists of computer-assisted telephone interviews with randomly sampled respondents from all 50 U.S. states and the District of Columbia. Qualitative data were collected from a series of focus groups, cognitive interviews, and web probing studies focused on respondents’ views towards data security, privacy and confidentiality conducted between 2015 and 2018.

Starting with qualitative data, we developed measures of ethical concerns and the concept of harm that respondents may have as the federal government moves towards larger scale data linkage. Initial qualitative findings suggest that respondent concerns focus on data security and the risk of identity theft, potential harm to children, financial loss, and the use of statistical data for regulatory enforcement (e.g., deportation or tax assessment). We examine respondent’s desires for privacy and confidentiality in the big picture of government data sharing. Findings point towards potential communication strategies that could increase public support of data linkage and sharing.

Anderson, Margo and William Seltzer. (2007.) Challenges to the Confidentiality of U.S. Federal Statistics, 1910-1965, Journal of Official Statistics, 23, pp. 1-34.

Hudson, Audrey. (2004.) Study Used Census Information for Terror Profile. The Washington Times, January 19, 2004, http://www.washingtontimes.com/national/20040118-114335-2930r.htm 1-24-2004.

Clemetson, Lynette. (2004.) Homeland Security Given Data on Arab-Americans. The New York Times, July 30, 2004, http://www.nytimes.com/2004/07/30/politics/30census.html.


Public Confidentiality Expectations Regarding Data Linkage

Ms Jennifer Childs (U.S. Census Bureau) - Presenting Author
Dr Casey Eggleston (U.S. Census Bureau)
Dr Aleia Fobia (U.S. Census Bureau)

Download presentation

Declining survey participation rates, increased demand for data, and declining federal budgets require consideration of supplementing or replacing survey reports with records and other data sources. Doing this requires a better understanding of public trust in statistics and statistical agencies, as well as how potential changes, such as incorporating big data, might affect those attitudes. Through a series of studies, this paper explores confidentiality expectations that could influence how the public views the statistical system in the United States, and how that could change if the statistical system moved in the direction of using big data.

Questions about public opinion of the federal statistical system and on data linkage were collected as part of the Gallup Daily Tracking Survey, which conducts computer-assisted telephone interviews with randomly sampled respondents from all 50 U.S. states and the District of Columbia.

Using the data, we examine favorability towards data-linkage and sharing for various purposes, whether the type of oversight (federal versus private) of data sharing affects these perceptions and under what circumstances respondents are amenable to replacing survey reports with records. We also examine respondent beliefs as to whether a federal statistical agency, like the U.S. Census Bureau, has the technical capability to keep data confidential, and whether those beliefs would change if data were combined with those from other government agencies. We show that variables such as belief in transparency and experience using data are correlated with respondent’s beliefs. Our findings suggest that in order to maintain public trust, statistical agencies need to communicate about major methodological changes in a transparent manner.


Evaluating Survey Consent to Social Media Linkage

Dr Zeina Mneimneh (University of Michigan) - Presenting Author
Miss Colleen McClain (University of Michigan)
Dr Lisa Singh (Georgetown University)
Dr Trivellore Raghunathan (University of Michigan)

Researchers across the social and computational sciences are increasingly interested in understanding the utility of “big” textual data for drawing inferences about human attitudes and behaviors. In particular, discussions about whether survey responses and social media data can complement or supplement each other have been frequently raised given the differences between the two sources on their representation and measurement properties. To answer some of these questions, a growing body of work has attempted to assess the conditions under which survey data and social media data yield similar or divergent conclusions. Most of this work, however, analyzes separately the corpora of survey and social media data and compares them at a macro level. While such research has value and gives suggestive evidence about the utility of comparing or combining such data sources, it is limited in its empirical investigation of the mechanisms of divergences between the two data sources which might be due to some combination of coverage, measurement, and the data generation process (designed vs. organic).

In order to expand on the research related to issues of comparability of social media posts and survey responses, few studies have conducted within-person analyses (e.g., Murphy, Landwehr, & Richards, 2013; Wagner, Pasek, & Stevenson 2015) and have tested methods of requesting consent to link a respondent’s survey responses to his or her social media data, generally in the form of asking for one’s username and for permission to access and link certain social media data. Most of this work, however, is limited either by the small yield of respondents who consent to link and provide usable information or is based on non-probability samples. Thus, systematically investigating questions on representativeness is hampered. Most importantly, issues related to consent language and the ethics of collecting and disseminating such data given their public availability (in the case of Twitter) has not been given the needed attention.

In this presentation, we build on these few studies and examine the incidence and predictors of consent to Twitter linkage requests from different types of probability-based surveys: a national survey, a college survey, and a web-based panel survey. We discuss the yield of such requests not only in terms of the rate of consent, but also the richness of the Twitter data collected as measured by the frequency of tweeting. The frequency of tweeting is essential given that the usefulness of Twitter data depends on the amount of information shared by the respondent. We conclude by discussing the ethical challenges involved. These include the essential components of consent statements to collect and link social media data in a transparent way, and issues of data dissemination given their “public” nature and the network of information available on other unconsented users.


Privacy-Preserving Methods for Linking Big Data and Survey Data Sets

Professor Rainer Schnell (University of Duisburg-Essen) - Presenting Author
Mr Christian Borgs (University of Duisburg-Essen)

Download presentation

For the prediction of social phenomena, a scientific explanation requires information on covariates considered to influence individual behavior. Until recently, social scientists mainly used experimental or survey data to obtain this kind of information. The increasing production of 'big data' sets such as transactional data, sensor data, social media data and administrative data seem to open new pathways to answer research questions.

However, most 'big data' sets contain only very few covariates. Therefore, most applications of big data for a predictive social science will need linkage of datasets. In practice, this requires the one-to-one identification of individuals in different datasets.

Linking large-scale administrative databases with data of the same type with survey data is technically simple compared to other linkage problems. Although some other 'big data' sources are already specific to individuals, e.g. transactional data, other databases, such as sensor data or social media data, are not directly related to specific individuals. To use this kind of non-identified data, for most explanatory applications, it has to be linked to data specific to individuals. Therefore, identifiers have to be present in both data sets. In the absence of identifying information in the data, linking 'big data' to survey data on individual respondents is impossible.

In general, there are three ways to link data on persons: using a unique identifier (Personal Identification Numbers), using unencrypted pseudo-identifiers (e.g. names, birthday) and using encrypted pseudo-identifiers (e.g. hashed names and birthday). If a unique ID is available, linking is a trivial merge operation. If no ID is available, either unencrypted identifiers or encrypted identifiers have to be used. More often than not, legal constraints require encryptions. For example, pseudonymization is strongly recommended for record linkage under EU regulations (EU Council Regulation No. 679/2016).

This legal requirement gave rise to the very active research field of Privacy-preserving Record Linkage (PPRL), which is devoted to methods for error-tolerant linking using encrypted identifiers only. This contribution will give an overview of big data PPRL implementations of techniques for linking natural persons across two or more databases without revealing their identities. We will present solutions for the private linking of big data sets using state of the art encryptions coupled with efficient computational techniques. We will discuss advantages and disadvantages of current methods, their safety against attacks and their computational efficiency. After giving best practice recommendations, we will demonstrate their implementation in our publicly available R library.


Determinants of Consent to Administrative Records Linkage in Next Steps - A Large-Scale Longitudinal Cohort Study in the U.K.

Dr Darina Peycheva (Centre for Longitudinal Studies – UCL Institute of Education) - Presenting Author
Professor George Ploubidis (Centre for Longitudinal Studies – UCL Institute of Education)
Dr Lisa Calderwood (Centre for Longitudinal Studies – UCL Institute of Education)

The value of enhancing survey data through linkage to administrative records is increasingly recognized among survey researchers. Linking survey and administrative data enables rich information from administrative records on a broad range of substantive areas to be combined with survey responses.

However, records linkage requires participants’ informed consent and participants’ refusal to give permission to administrative records linkage (i.e. nonconsent) leads to reduction in the sample size of the linked to the survey administrative data, and more importantly to potential bias resulting from differential patterns of consent that may lead to not completely random patterns of missing data.

The existing research on patterns of consent to records linkage shows that there are biases in respondents’ consent and the samples of consenting respondents may not be representative of the populations studied; and these biases differ according to the type of data requested. Furthermore, it suggests that the predictors of consent comply with factors predictive of (non)response in general, such that respondent demographic and socio-economic characteristics, and survey design features have strong impact on consent (Mostafa, 2016; Jenkins et al., 2004; Dunn et al., 2004, Watson, N., Wooden, M., 2009).

This paper contributes to the existing knowledge about administrative records consent by exploring consent patterns in a large-scale longitudinal mixed mode cohort study in the UK. We use data from Next Steps (previously known as Longitudinal Study of Young People in England) to look at determinants of consent in the health, education, economic and crime domains; and the consistence of our findings with the existing literature on consent patterns.

We employ a data driven analytical approach, as opposed to a theory driven approach, making use of all available data in the cohort - collected during the study lifecycle, rather than an arbitrary selection of variables - guided by the literature on consent bias, and exploit the possibility of unknown predictors of consent, from various domains. Capitalising on the richness of the Next Steps information, available from prior sweeps, we will enable researchers to consider a wider range of predictors of consent as auxiliary variables in principled approaches of missing data handling. It is well known that approaches such as multiple imputation, full information maximum likelihood, linear increments and inverse probability weighting, operate under the Missing At Random assumption, which implies that most predictors of missingness should be included in the model, or that selection to consent is due to observables (Little and Rubin, 2002; Carpenter and Kenward, 2013). Therefore identifying all predictors of consent will help maximise the plausibility of the MAR assumption in any analysis that utilises linked consented data.