BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Link and learn: Improving estimation through data linkage and machine learning

Moderator: Antje Kirchner ([email protected])
Slack link
Quick Zoom

Detailed zoom login information
Friday 20th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Data integration of national surveys and nonprobability samples to enhance analytic capacity

Dr Steven Cohen (RTI International) - Presenting Author
Ms Jennifer Unangst (RTI International)

The quality and content of national population-based surveys are often enhanced through integrated designs that link additional medical, behavioral, environmental, socio-economic and financial content from multiple sectors. This would include connectivity to existing secondary data sources at higher levels of aggregation and via direct matches to additional health and socioeconomic measures at the individual level acquired from other sources of survey, health system, economic or administrative data. Advances in data science are also serving to facilitate the effective and efficient utilization of statistical methods in concert with big data applications to develop these enhanced analytical platforms and infrastructure. Recent efforts by the Committee on National Statistics of the National Academy of Sciences is serving as a catalyst to advance future national data integration efforts. These integrated data platforms would include content drawn from nonprobability based samples to enhance analytic capacity.
The integration of national survey data with content from nonprobability based samples has the capacity to provide greater insights than possible from any of the component sources. Often, these data integration efforts attempt to step beyond the representational limitations of the content obtained from the nonprobability-based samples when attempting to augment the nationally representative survey data. In this presentation, we focus on an alternative framework to account for the limitations of the estimates derived from nonprobability samples. Attention is given to the identification of the segment of the target population represented by the nonprobability-based samples and restricting inferences based on the integrated data to these subdomains.
In this study, the content in selected Project Data Sphere (PDS) cancer patient-level phase III clinical datasets have been augmented by linking the social, economic, and health-related characteristics of like cancer survivors from nationally representative health and health care-related data from the Medical Expenditure Panel Survey (MEPS). The MEPS is characterized by an integrated design, with analytic enhancements achieved through data linkages to other national surveys including the National Health Interview Survey, surveys of medical providers, medical organizations and employers. The MEPS is also linked to administrative data from Medicare and Medicaid, the National Death Index and other secondary data sources. The PDS research platform was established to provide broad access to de-identified patient level data from clinical trials that enable big data-driven research.
Using statistical linkage and model-based techniques, patient-level records in selected PDS datasets have been linked to those of comparable cancer survivors, and are thereby augmented with survey content on social, economic, and health-related characteristics. This study provides an overview of the methodologies used to connect patient-level clinical trial data with nationally representative health-related data on cancer survivors from the MEPS. Study findings include probabilistic assessments of the representation of the patients in the respective clinical trials relative to the characteristics of cancer survivors in the general population and an evaluation of the reproducibility of analytic findings. The study illustrates the enhancements achieved to the analytic capacity and utility of the PDS cancer clinical trial data through

Combining probability and non-probability sample data from administrative data to produce official statistics

Dr Taylor Lewis (RTI International) - Presenting Author
Dr Dan Liao (RTI International)
Dr Marcus Berzofsky (RTI International)
Dr Alexia Cooper (RTI International)

Download presentation

Recently, increased attention has been given to exploiting administrative data for purposes of estimating official statistics. However, the utility of administrative data is often hampered by coverage limitations caused by some administrative sources within the population of interest not participating in the collection of data in the proper form. Strategies frequently used in applied survey research such as weighting can be employed to compensate for the undercoverage, but it is generally difficult to know whether the bias has been eliminated. In this paper, we present research on methods for selecting a representative (i.e., probability-based) sample of non-participating administrative sources and blend them with the existing non-representative set through a series of weight calibration steps. Through access to an alternative “gold standard” source for several key aggregate measures, we indirectly assess whether the resultant weights can be used to generate unbiased official statistics.

The specific administrative data source used to illustrate these strategies is the National Incident-Based Reporting System (NIBRS). NIBRS was implemented in the early 1990s to address certain critical shortcomings of the Uniform Crime Reporting program, a key source of official crime statistics in the United States since the 1930s. NIBRS collects much more detailed information on criminal events, but not all law enforcement agencies (LEAs) have modernized their IT systems to transmit data in the requisite format. At present, approximately 40% of the roughly 18,000 LEAs in the United States report to NIBRS. To address the coverage gap, the Bureau of Justice Statistics and the Criminal Justice Information Services of the Federal Bureau of Investigation established the National Crime Statistics Exchange (NCS-X) Initiative. The purpose of NCS-X was to designate a stratified random sample of 400 LEAs from the pool of non-NIBRS reporters as of 2011 to receive additional resources and support to facilitate their transition into NIBRS reporters. Unfortunately, as of 2020, not all 400 LEAs have transitioned to report NIBRS data. Complicating matters further, hundreds of LEAs outside of the 2011 NCS-X sample have begun reporting to NIBRS. Thus, it is not immediately obvious how best to combine and weight data from these two subsets of LEAs with LEAs already reporting to NIBRS in 2011 to represent the full population. Attendees can expect to gain a deeper understanding of the underlying assumptions, challenges, and validation strategies employed to develop LEA-level calibrated weights for inference. These topics should generalize to any other situation – using administrative data or otherwise – where data from a probability and non-probability sample need to be combined to make proper population inferences.

The bias of crime statistics: Assessing the impact of data bias on police analysis and crime mapping

Dr David Buil-Gil (University of Manchester) - Presenting Author
Dr Angelo Moretti (Manchester Metropolitan University)
Mr Samuel Langton (Manchester Metropolitan University)

Police-recorded crimes are the main source of information used by police forces to analyse crime patterns, investigate the spatial concentration of crime, and design spatially targeted strategies. Police statistics are used to design and evaluate crime prevention policies and to develop theories of crime. Nevertheless, crimes known to police are affected by biases and unreliability driven by unequal crime reporting rates across social groups and geographical areas. The measures of error that affect the reliability of crime statistics is an issue that merits deeper scrutiny, since it affects police everyday practices, criminal policies and citizens’ everyday lives. Yet it is an understudied issue, and the implications of data biases for crime mapping are unknown. Moreover, police analyses are moving towards the study of smaller levels of geography than ever before, such as street segments with highly homogeneous communities. Maps produced from police records are used to foreground the micro places where rates of recorded crimes are larger. This paper presents a simulation study and an application to analyse the impact of data biases on crime maps produced from police records at the different spatial scales. It assesses whether micro-level maps are affected by a larger risk of bias than maps produced at larger scales.
Based on parameters obtained from the UK Census 2011, we simulated a synthetic population consistent with the social-demographic and spatial characteristics of Manchester, England. Then, based on model parameters derived from the Crime Survey for England and Wales 2011/12, we simulated the number and type of crimes suffered by individuals across social groups and areas, and predicted the likelihood of these crimes to be known to police. It allowed us to compare the relative difference between all crimes and police-recorded incidents at the different scales: (a) 1,530 Output Areas with an average of 328.8 residents, (b) 282 LSOAs, (c) 57 MSOAs, and (d) 32 wards. While the average relative difference between all crimes and those known to police is close to 62% for all geographical scales, the measures of dispersion of the relative difference between all crimes and police records are much larger when crime incidents are aggregated at the levels of small geographies. In other words, when producing maps at the scales of medium-level geographies, the percentage of crimes known to police is similar in all areas, and thus the risk that police statistics underestimate or overestimate crime rates in some areas more than others is small; whereas the percentage of unknown crimes varies widely across micro places. This has important implications for policing, policy making and research: police strategies, criminal policies and crime theories drawn from police records aggregated at the scales of small communities are likely to be affected by large biases that underestimate the prevalence of crime in certain places while overestimating its prevalence in others. We also provide an application that shows the large impact that data biases have on micro-level maps produced from crimes registered by Greater Manchester Police in 2011.

Estimation of time-varying state correlation in state space models

Miss Caterina Schiavoni (Maastricht University) - Presenting Author
Professor Siem Jan Koopman (Vrije Universiteit Amsterdam)
Professor Franz Palm (Maastricht University)
Dr Stephan Smeekes (Maastricht University)
Professor Jan van den Brakel (Statistics Netherlands)

Statistics Netherlands uses a state space model to estimate the Dutch unemployment by using monthly series about the labour force surveys (LFS). More accurate estimates of this variable can be obtained by including auxiliary information in the model, such as the univariate administrative series of claimant counts and the high-dimensional series of Google searches related to the unemployment. Legislative changes may affect the relation between unemployment and claimant counts. Additionally, the relevance of both specific search terms as well as internet search behaviour might change over time. We propose four different methods to estimate the relations between the unemployment and the above-mentioned auxiliary series as time-varying: a generalized autoregressive score (GAS), a cubic splines, an extended Kalman filter, and an importance sampling approach. The first three estimation methods allow to test for the null hypothesis of constant relations, and to build confidence intervals. We conduct a simulation study in order to assess the performance of all estimation methods.

Bayesian forecasting of voter intention from a pre-election survey with complex sampling: A case study from Uruguay

Mr Stephen Hornbeck (U.S. Department of State)
Dr Sarah Staveteig Ford (U.S. Department of State) - Presenting Author
Dr Matthew Williams (National Science Foundation)
Dr Terrance Savitsky (Bureau of Labor Statistics)

Most international face-to-face surveys employ a disproportionately stratified cluster sampling design. Bayesian estimation methods on these samples do not correctly incorporate the design error: the usual estimation approach utilizes a sampling weighted pseudo posterior that produces overly optimistic confidence intervals since it does not account for variation induced by the sampling design distribution. This paper presents a case study from a pre-election survey carried out in Uruguay in September of 2019, using a complex sampling design. The survey was fielded a prior to the first round of the 2019 Uruguayan Presidential elections, during which no candidate obtained a majority. The top two winners of the first round, Daniel Martínez and Luis Lacalle Pou, faced off in a runoff contest in November of 2019. There was considerable uncertainty about how voters for defeated first round candidates would coalesce around either of the two remaining candidates. We employ a Bayesian estimation approach of our pre-election survey with a correction for posterior draws that produces correct uncertainty quantification and compare our approach with usual practices for estimation of confidence intervals. We illustrate how we may combine expert knowledge with leveraging of the posterior distributions for voter proportions to estimate the probabilities of which coalition of candidates will, subsequently, win an election using our Uruguay example. Our results show promise for other Bayesian forecasting applications with complex survey data.