BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back

Volume and/or value?! Improving data quality in the era of big data

Moderator: John Finamore (jfinamor@nsf.gov)
Slack link
Quick Zoom

Friday 13th November, 10:00 - 11:30 (ET, GMT-5)
7:00 - 8:30 (PT, GMT-8)
16:00 - 17:30 (CET, GMT+1)

Big data: big claims or bigger calamities? Getting real about survey research in the fourth era

Dr Trent Buskirk (Bowling Green State University) - Presenting Author
Dr Antje Kirchner (RTI International)

In his 2011 landmark essay, Groves articulated three distinct eras of survey research. The first era saw the boom of development in survey data collection methods and tools to produce statistical information. The second era saw an increase and refinement of these tools, especially to support production of official statistics. Declining response rates to more traditional methods of survey collection drove innovation in alternate data collection modes and rises in internet penetration led to more interest and development in web surveys in the third era. The growing penetration of both internet service and smartphones have led to a new stream of data that are collected through meters, sensors and apps. These and other big data sources emerging in the third era began to be referred collectively, as organic or big data.

In this paper we argue that we are transitioning from the third era of survey research into a new, fourth era. While identification and exploration of organic data for producing estimates of public opinion and official statistics started in the third era, we posit that the use of big data and big data analytics will be taken to a new scale in the fourth era.

There are five main developments that define and differentiate the fourth era from the previous three. First, both big data sources and big data analytics will completely permeate virtually every part of the survey process including study design, questionnaire design, sampling design, sample recruitment, data processing and estimation. As such, these organic data sources are not just being considered as an augment or replacement for surveys to produce estimates of public opinion or official statistics, but they are also being used as sample sources, as augments to existing sampling frames, or as sources of question construction.. Second, in the fourth era, survey researchers will be expanding the fit for purpose paradigm to include big data sources, in their own right. Third, survey researchers will continue to vastly expand the use of artificial intelligence for all aspects of survey research including increased automation using machine learning methods for processing, coding and imputation, among others, at scale. Fourth, survey researchers will not just be using big data sources in the fourth era, but instead will integrate their experience with designed data and contribute both data and methods to the big data ecosystem. We see particularly great potential for survey researchers to offer improvements for sampling of big data sources and in data quality evaluation. Fifth, survey researchers will begin to leverage the “why” aspect of survey data and how this aspect of survey information adds value and enhances predictive models widely used within the big data ecosystem.

In this paper we characterize the fourth era and its implications for survey research and data/computer science. Specifically, we provide examples of how various big data sources are used within each part of the survey research process, and discuss how survey research is adding value to the big data sources and data science methods in general.

How worried should we be? The implications of fabricated survey data for political analysis

Dr Oscar Castorena (Vanderbilt University) - Presenting Author
Dr Mollie Cohen (University of Georgia)
Dr Noam Lupu (Vanderbilt University)
Dr Elizabeth Zechmeister (Vanderbilt University)

Political scientists have become increasingly concerned about the reliability of data from enumerator-administered surveys. In many cases, fieldwork monitoring is insufficient to prevent enumerators from fabricating interviews instead of undertaking the laborious work of actually interviewing people. How worried should we be about this possibility? How likely are fabricated data to alter the inferences we draw from survey data? To answer these questions, we leverage a unique dataset from Venezuela. While conducting a national survey in 2016-17, the LAPOP Lab deployed an extensive protocol for fieldwork monitoring that discovered and replaced over 400 fabricated interviews – nearly one-third of the sample. We use this dataset to compare inferences based on the clean dataset to the compromised dataset that would have resulted had these cases not been replaced. We find that estimates of frequencies are sometimes affected by the existence of fraud, but that correlational results hold, even in a dataset with an unusually high proportion of fabricated cases. This is because enumerators largely seem to fabricate plausible data. We also find that post hoc efforts to uncover fraud fail to detect nearly all the fabricated cases in our data.

Exploring precision farming data: A valuable new data source for official statistics?

Dr Ger Snijkers (Statistics Netherlands) - Presenting Author
Mr José Gómez Pérez (Statistics Netherlands)
Dr Sofie De Broe (Statistics Netherlands)

Download presentation

Businesses are ever innovating. New business processes, depending on the industry, are heavily data-driven nowadays. A good example is precision farming, where sensors aid farmers in their business operations. The generated data are full of information that might also be useful for National Statistical Institutes (NSI), and used instead of collecting data via survey questionnaires. In that case, response burden on these businesses could be greatly reduced, while at the same time improving data quality. In addition, these sensor data could contain information that traditionally was too detailed and technical to ask for in questionnaires. This might address highly relevant issues in modern society, like detailed and timely information on use of crop protection, soil humidity, manure production, etc. Furthermore, these data may be used in (nearly) real-time benchmark indicators in complement to statistics. In theory sensor data thus might be a valuable new source for official statistics.
In order to study these research goals, a number of studies have been conducted. First, a case study has been carried out in which Statistics Netherlands (CBS) worked together with an innovative arable farmer and the Eindhoven University of Technology. The farmer has made a selection of his (sensor) data available to CBS with the purpose of exploring the data for overlap with relevant surveys and other promising aspects. First, the data generating process and the available data have been examined with regard to data quality issues. Next, the overlap between the generated data and data asked for in questionnaires has been studied to get an idea of the options.
Next, Farm Management Information Systems have been studied. In the Netherlands two major crop registration systems are used by arable farmers. We studied the content of one of them. The latest step is the exploration of data collected by one of the major producers of precision agriculture machines. These machines generate a lot of data that are stored in their cloud system. Instead of collecting data from individual farmers, here we contacted the machine manufacturers.
The conclusion is that these data sources may be valuable, but there is still a long way to go. Challenges that have surfaced during this research included e.g. data quality issues and missing meta-information. For farmers trust in the data user when sharing data is of utmost importance; for market parties the question “what is in it for me?’ is paramount. These challenges will be discussed. In addition a number of other criteria that are relevant when using these new data sources as input for official statistics will be discussed. These include e.g. the ubiquity of these data among farmers, harmonisation of data definitions, data access, and stability of data definitions and delivery over time, to name a few. Even though it is still a long way to go, we feel that now is the time to start examining and discussing the challenges and options with regard to these new data sources in order to be ready for the future.

Multilevel Multiple Imputation in Big and Complex Administrative Data

Dr Amang Sukasih (RTI International) - Presenting Author
Dr Dan Liao (RTI International)
Ms Jeniffer Iriondo-Perez (RTI International)
Mr Philip Lee (RTI International)
Dr Marcus Berzofsky (RTI International)
Dr Alexia Cooper (Bureau of Justice Statistics)

Similar to other “Big Data” sources, administrative data are not free of item missingness or unknown values. When treatment of missing values is considered, such as statistical imputation of missing values, the size and complex structure of the data pose some challenges. For example, when an administrative data collection contains multivariate multilevel (hierarchical) data, imputation of its missing data needs to consider utilizing a multilevel modeling approach, because variables may correlate across different levels, and failing to account for the full structure of multilevel data may create bias in the imputed data. In addition, unlike item nonresponse due to the reporting unit refuses to provide information for an item, item missingness in administrative data may be due to other mechanisms that should be incorporated into the treatment of missing values. In a multilevel database, failure to report an observational case at different levels (e.g. schools, classes, or students) will result in missing values for a set of related variables (block missing), which could be a challenge for statistical modeling. Another mechanism that produces item missingness (in this case, a case of “masked” missingness) involves the existence of an “unknown” category response option. Several different factors need to be considered when deciding whether a response of “unknown” should be considered as item-level missingness and imputed or if the response should be treated as a legitimate non-missing value. For example, if the data item conflates legitimate “unknown” and the inability of the reporter to report the information. Furthermore, given the large size of many administrative data sources, it is not clear whether the implementation of complex imputation methods is worth the impact it will have on any estimates produced.

In this presentation we describe dealing with missing and unknown values in the U.S. National Incident-Based Reporting System (NIBRS), an incident-based reporting system used by law enforcement agencies for collecting and reporting data on crimes. NIBRS contains more than 5 million incidents reported annually, featuring data elements in a complex multilevel structure that captures information on reporting agencies, details on each single crime incident reported by them, as well as on separate offenses, victims, offenders within the same incident. The complexity of the data and its missing pattern require us to consider these factors and break down the imputation into several approaches. Because NIBRS data is a multilevel structure, our imputation work needs to use hierarchical modeling. In addition, imputing for block missingness require us to first estimate the number of unreported incidents. Moreover, due to the absence of benchmark of total values and to consider uncertainty in our estimates we use multiple imputation to fill in missing values. Last but not least, we need to be able to use of open source software/freeware in computation.

Both sides of the story: Combining student-level data on reading performance from administrative registers with data from a reading app

Professor David Reimer (Aarhus University, School of Education) - Presenting Author
Dr Stefan Oehmcke (Copenhagen University, Department of Computer Science)
Dr Bent Sortkær (Aarhus University, School of Education)
Dr Ida Gran Andersen (Aarhus University, School of Education)
Mr Emil Smith (Aarhus University, School of Education)

Most public schools in Denmark today work with digital learning platforms and digital teaching tools. One of these tools is a reading program (app), called Bookbites*, which offers students (grades 1-9) unlimited access to a large variety of e-books and also awards students for frequent reading with stars and medals - similar to practices best known from fitness tracker applications. At the same time, Denmark has a large database of administrative registers (see Jensen & Rasmussen 2011), where all student data on educational progression and performance are recorded. The administrative registers also allow for the identification of parental characteristics such as parental income, profession and level of education. In this paper we take advantage of these features and report first results from an innovative merged dataset of reading-app and administrative registers from a subset of Danish public schools that granted us access to the reading app data - in full compliance with GDPR regulations. While the registers allow for access to data on of student test data on reading (based on item-response-theory) and rich measurement of the students' social background, the reading-app records reading-speed, reading-frequency as well as difficulty and genre of the respective books that the students read. In a first step, the paper reports simple descriptive comparisons of student test-performance data for various grades and other subgroups with key user-generated reading app-data on reading speed and frequency. In a second step, we will consider the reading-app measures, primarily reading-speed and frequency patterns as dependent variables and regress them on student background and school characteristics. That way we will gain a better understanding of relative importance of individual and school level factors that affect app use and performance. In additional analyses we will also explore the selection into the use of the reading app-data. We conclude the paper with a discussion of what kind of story the respective two different data sources tell us about student reading performance and behavior. Avenues for further research and analyses will be discussed.

References:
Jensen, V. M., & Rasmussen, A. W. (2011). Danish education registers. Scandinavian Journal of Public Health, 39(7_suppl), 91–94. https://doi.org/10.1177/1403494810394715
*https://bookbites.com/en/

You are here