BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Leveraging Big and Nontraditional Datasets to Reduce Burden, Increase Response, and Assess Survey Quality

Chair Ms Nancy Bates (US Census Bureau)
TimeSaturday 27th October, 14:00 - 15:30
Room: 40.004

This session highlights research findings from a broad spectrum of Big Data applications across both household and establishment surveys. Within the context of establishment surveys, the presenters will share results from tests designed to reduce respondent burden and improve point estimates and geographic granularity when substituting survey reports with directly extracted data and point-of-sale transactions. Additionally, the use of machine learning to reduce burden for complicated classification systems reporting will be discussed. Within the context of household surveys, the presenters will share success stories of blending marketing data with more traditional variables to improve response propensity models for the purposes of adaptive survey designs and tailored and targeted nonresponse interventions. Finally, a two-way assessment of Big Data and survey data quality will be reported between Zillow.com data linked to household survey data to better understand the quality of each.

Alternative Approaches for Measuring the Movement of Goods in the United States

Ms Julie Parker (U.S. Bureau of Transportation Statistics) - Presenting Author
Ms Joy Sharp (U.S. Bureau of Transportation Statistics)

Download presentation

The U.S. Commodity Flow Survey (CFS) collects information from establishments about the goods they ship and the value, weight, mode of transportation and destination of their shipments. These data provide a comprehensive, multimodal picture of national freight flows for sampled industries. Furthermore, the CFS provides the core dataset used in constructing the Department of Transportation’s Freight Analysis Framework (FAF). The FAF integrates data from a variety of sources to create an even more comprehensive picture of freight movements by all modes of transportation.

To remain effective in measuring nationwide goods movements, it is imperative that CFS and FAF leverage improved data collection technologies and alternative data sources. This presentation will discuss two new approaches for potential implementation in the next CFS and FAF. The first approach explores the feasibility of extracting more timely and voluminous shipment information directly from companies using traditional business systems such as Enterprise and Distribution Resource Planning (ERP and DRP). Often records in these systems are quite different from the CFS in terms of content and structure. Requests to access these systems can also elevate concerns about privacy and use. However, successful implementation could greatly improve estimate precision, geographic granularity and frequency of estimates, while reducing respondent burden.

In addition, research is also ongoing to determine whether machine learning algorithms can be implemented for purposes of commodity classification. The CFS currently asks respondents to classify commodities according to a classification system called the Standard Classification of Transported Goods (SCTG). This is an especially burdensome task as respondents do not use these codes as part of their normal business functions and there are several hundred codes. Respondent burden could be significantly lessened if the SCTG code could be predicted and assigned based on the respondent’s reported description of the commodity.


Using Alternative Data Sources to Reduce Respondent Burden in United States Census Bureau Economic Data Products

Final candidate for the monograph

Ms Rebecca Hutchinson (U.S. Census Bureau) - Presenting Author

Increased respondent burden has led to declining response rates for many of the United States Census Bureau’s economic data products, including the indicator programs. Many high-burden or non-reporting companies already provide comparable data to private sector companies for market research purposes. Can that same data be used by the Census Bureau to reduce respondent burden while maintaining or enhancing the quality of its published data?
In 2017, a pilot project was undertaken by an Economic Directorate retail big data team along with the NPD Group, Inc., a privately held market information and business solutions company that captures point-of-sale transaction data from major retailers at the store and product level. In this pilot, we compared aggregated sales data from NPD at the product, store, and national levels to sales data reported by the retailer to our retail data products. The findings from this effort found that the NPD sales data aligned well at the national level when compared to data reported to our Monthly Retail Trade Survey and the Annual Retail Trade Survey, and at the store level when compared to the 2012 Economic Census.

Building upon the success of the pilot, we are now expanding the effort to a production environment to include up to 100 more retailers with a focus on non-responding and high-burden retailers. This project has been identified as a pilot candidate for the new Census Bureau Data Lake, which would allow data sharing across the enterprise.
This paper focuses on the potential gains and challenges that this expansion effort brings with it. With this data, we have the potential to reduce respondent burden, improve data quality, and create additional data products. In order to achieve those goals, we need to develop both an efficient and effective quality review process of the data and an automated, seamless transfer of the data to survey infrastructure, all while operating under limited budgetary resources.


Using Linked Survey and Administrative Data to Assess the Quality of Each Contributing Data Source

Final candidate for the monograph

Dr Rupa Datta (NORC at University of Chicago) - Presenting Author
Ms Gabriel Ugarte (NORC at University of Chicago)
Mr Dean Resnick (NORC at University of Chicago)

This paper describes an effort to conduct address-level linkages of newly available restricted-use research data files from Zillow.com with data files from the nationally representative 2012 National Survey of Early Care and Education (NSECE). We use the linked administrative and survey data to learn about the quality of each – that is, what can the linked data tell us about the survey data quality and about the administrative data quality, and therefore about the usefulness of the linked data?

The NSECE sampled 100,000 housing units to conduct interviews with 11,629 households with children under age 13 in all 50 states and the District of Columbia. All available Zillow fields were linked to the survey data files, with the most recent record retained for each address. Nationally, 83 percent of NSECE sampled addresses matched to a Zillow record at the lot level (e.g. 123 Main St), with the lowest state-specific lot-level match being 59 percent. Unit-level matches (e.g., 123 Main St Apt 2B) were poorer, at 63 percent nationally and with two states falling below 40 percent of NSECE units matching to a Zillow unit. At both the unit and lot levels, match rates varied substantially by state.

Zillow is a proprietary data and web services company that uses publically available data to estimate the current market value of residential properties in the U.S. Data in the restricted use research file are gathered from third party vendors and public jurisdictions, and include real estate transactions, property tax records, dwelling characteristics, and other information.

We first assessed the coverage properties of the Zillow file by comparing the characteristics of the linked Zillow-NSECE questionnaire file with characteristics of the survey-only data file, for example, on survey data items such as self-reported household income, home ownership status, race/ethnicity, and household composition. We found that the linked file significantly over-represented high income households, which is consistent with Zillow records being more often available for single-family and owned homes vs multi-family or rental homes.

We also examined the linked Zillow-NSECE sample file of all housing units that had been originally sampled for the NSECE, including those that were vacant or otherwise ineligible for screening, eligible for the survey but not interviewed, completed interviews, etc. We investigated sources of bias in the survey process, for example, whether or not addresses designated as ineligible for screening were more likely to be single-family homes or differed in market value from addresses identified as eligible for screening. We found almost no statistically significant patterns at any step in the data collection process among the Zillow variables.

The exercise was informative on how to use linked data to better understand the quality of each contributing data source and on the analytical potential of the linked data.


Leveraging Nontraditional Data to Improve Response Propensity Models and Design Tailored and Targeted Geographical Nonresponse Interventions

Dr Mary Mulry (U.S. Census Bureau) - Presenting Author
Ms Nancy Bates (U.S. Census Bureau)
Mr Matthew Virgile (U.S. Census Bureau )

This paper reports results from combining Big Data with traditional survey data sources, specially the application of a dataset traditionally applied for marketing purposes -- Esri’s Tapestry lifestyle segments. We merge these marketing data with US Census Test data and a new metric used for identifying areas with hard-to-count populations (the Low Response Score). The merge produced a dataset suitable for studying relationships between actual census response, response propensity scores, and lifestyle segments. We use the merged dataset to examine whether lifestyle segments can provide insight to hard-to-survey populations, their response behavior, and interactions with social marketing communications. The paper also includes analyses with nationwide data that support the broader application of using segmentation variables in self-response propensity models and a discussion of potential applications of segment lifestyle information in tailored and targeted survey designs for hard-to-survey populations.

Tapestry is composed of lifestyle segments formed from aggregated datasets that include 5-year ACS estimates and the Decennial Census. Additionally, the segments use other consumer surveys that reveal lifestyle choices, such as purchasing behavior and how households spend their free time. Using lifestyles to segment audiences is commonplace as a means of targeting of goods and services, but not commonly applied for purposes of survey research. Using these geographic segments and the LRS, we re-analyzed data from the 2015 Census Test conducted in Savannah, Georgia.

One goal of the research was to test whether Tapestry segments capture meaningful LRS and self-response variations and if so, whether the segments might be applied for purposes of planning the 2020 Census communication campaign. A secondary goal was to test whether addition of lifestyle variables might improve response propensity models – models applied widely for survey research purposes, including adaptive designs. In fact, we found a great deal of variation in LRS’s between segments, and very high correlation between mean segment LRS and self-participation rates. We discuss the implications of these findings for both planning the 2020 communications campaign, as a new input for response propensity models used in adaptive designs, and tailored and targeted intervention to increase unit survey response.


Discussant for Organized Session Titled: Leveraging Big and Nontraditional Datasets to Reduce Burden, Increase Response, and Assess Survey Quality

Ms Nancy Bates (U.S. Census Bureau) - Presenting Author

Nancy Bates is the Senior Researcher for Survey Methods Research at the US Census Bureau. Nancy will serve as a discussant for the four papers in this organized session titled "Leveraging Big and Nontraditional Datasets to Reduce Burden, Increase Response, and Assess Survey Quality". Nancy is a Fellow of the American Statistical Association, Associate Editor for the Journal of Official Statistics, and Past-President of the Washington Statistical Society. Nancy will also serve a Chair for this session.