BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





New Approaches to Augment Sampling Frames II: Leveraging Data Science Methods for Sample Frame Construction

Chair Dr David Dutwin (SSRS)
TimeSaturday 27th October, 16:00 - 17:30
Room: 40.008

Using Big Data to Improve Sampling Efficiency

Final candidate for the monograph

Ms Jamie Ridenhour (RTI International) - Presenting Author
Mr Joseph McMichael (RTI International)
Dr Karol Krotki (RTI International)
Mr Howard Speizer (RTI International)

Download presentation

The US Coast Guard National Recreation Boating Survey is designed to measure recreational boating to develop water safety exposure estimates in all 50-states, Puerto Rico, and the District of Columbia. In the past, the survey relied on state boat registries as a core sampling frame, which may have led to underrepresenting exposure reports from owners of unregistered boats (e.g., canoes, kayaks, paddleboards). To mitigate this potential underrepresentation, we augment the registry frame with an address-based sample from RTI’s Enhanced Frame. RTI’s Enhanced Frame contains all mail delivery points in the US in addition to auxiliary data from commercial vendors. To increase the efficiency of the ABS frame we developed an eligibility propensity model based on data patterns from auxiliary data sources including registration data, and 10+ million transaction-based records from commercial database sources. In this presentation we discuss combining RTI’s Enhanced ABS Frame, the registry frames, and the commercial databases to maximize the incidence rate while keeping the design effect reasonable and the survey costs contained. We will discuss various models used to identify our target population and how we were able to leverage the various sources of external data to improve the efficiency of our sample design.


Machine Made Sampling Designs: Applying Machine Learning Methods for Generating Stratified Sampling Designs

Dr Trent Buskirk (Center for Survey Research, UMass Boston) - Presenting Author
Dr Todd Bear (University of Pittsburgh)
Mr Jeffrey Bareham (Marketing Systems Group)

Download presentation

With increases in technology and big data streams comes increased opportunities for researchers to access information about members of diverse populations. But these same technologies have also created new taxonomies and subpopulations that are themselves of interest. For example, researchers looking to use a single cell RDD sampling frame are now interested in the population defined by their lack of adoption of cellular telephones. The so called “landline only” population has been shown to have key differences on health related outcomes and tend to be older, for example and such differences may result in coverage biases for cell RDD designs. Health related researchers are also interested in more specific subpopulations defined by health risks or geographic boundaries related to disease outbreaks. In any case, constructing sampling frames that sufficiently track with these new subpopulations of interest continues to be an active area of survey research. In terms of population coverage and the ability to link auxiliary information sampling units in order to create sufficient coverage of various subpopulations, ABS samples certainly offer the most flexibility and have the highest potential out of the gate. Extending this into telephone sampling designs is much more challenging, as general sampling frames of telephone numbers typically have less auxiliary information available. While the telephone number contains some high-level geographic information via the area code, the specificity of this information varies by whether or not the phone number is a landline. If levels of geography that are smaller than state or city are required, then one must go beyond the area code.

In this study we utilize an unsupervised machine learning method to develop a sampling design for a study of health related risk factors for adult residents of Allegheny County, Pennsylvania. The study design called for stratifying the landline RDD sample by up to 13 relevant health districts whose known boundaries were not at all congruent with the telephone number banks (e.g. 1000 banks) within the landline frame. To inform the sampling design and to attend to the required geographic specificity, we leveraged information available for assigned landline telephone numbers within each of the 2,410 1000 banks. More specifically, for each 1000 bank, we determined the proportion of its assigned landline numbers that fell within each of the 13 health district boundaries - represented as a 13-dimensional vector. These vectors were then processed using a k-means clustering algorithm to determine groupings of 1000 banks that had similar geographic profiles with respect to the 13 health districts. Results indicated that grouping 1000 banks into two distinct clusters produced the least overlap. The corresponding sampling design then resulted in two strata – the first consisted of 1000 banks that covered 4 of the health districts and the second contained those 1000 banks that covered the remaining 9 health districts. In this presentation we discuss the approach and the translation from k-means clustering to stratification. We also explore the relationship between cluster overlap and coverage and sampling efficiency using actual location data collected.


Machine Made Sampling Frames: Creating Sampling Frames of Windmills and Other Non-Traditional Sampling Units Using Machine Learning with Neural Networks

Dr Adam Eck (Oberlin College) - Presenting Author
Dr Trent Buskirk (University of Massachusetts-Boston)
Dr Kenneth Fletcher (University of Massachusetts-Boston)
Mr Peter Stefek (Oberlin College)
Ms Han Shao (Oberlin College)
Mr Ki Park (University of Northern Iowa)
Dr Mary Losch (University of Northern Iowa)

Survey researchers rely on sample frames that cover the population in order to perform probability-based sampling. These sampling frames are typically available from third party vendors in the form of addresses that cover housing units, phone numbers that cover individuals, or email rosters that cover employees. However, not every population of interest can be covered with existing frames. Certainly with diversifying interests in surveying non-traditional populations—places, specific subsets of people that are not easily identifiable within existing sampling methods or frames, and sensors in the internet of things producing organic data—more effort is needed to identify plausible and appropriate sampling frames.

In some cases, population enumeration may be an extension of current frame creation methods, but in other cases, new methods for efficient enumeration are needed. For example, to generate an estimate of clean energy created via windmills within a given geographic region (e.g., a state), a sampling frame might first be constructed by field enumeration methods or compiling multiple public record files related to energy consumption, farm ownership and windmill purchases. However, since windmills are owned by a mix of public and private parties, it is unlikely that publicly available sources could be used to create a viable sampling frame. Moreover, since windmills are scattered throughout the entire state, traditional field enumeration methods may prove accurate but too expensive to be viable for the estimation task at hand.

In this research we explore new opportunities for applying big data, machine learning and other data science methods to create sampling frames of places, objects, and other collections of non-traditional sampling units. More specifically, we develop an application that systematically selects a random sample of satellite or aerial images from a desired geographic region and then process those images using neural networks to identify the presence of sampling units of interest such as windmills, playgrounds, or park benches, for example. The user specifies the radius of image capture around a given location and the application proceeds to generate samples of images across the specified geography. The generated images are then systematically processed by convolutional neural network models to provide an indication of the number and location of sampling units of interest present in the images. The application then gathers this information and produces a sampling frame. As a byproduct, population estimates for the number of sampling units of interest in a given area can also be produced. This paper presents the overall application and an evaluation of its accuracy for creating a sampling frame of windmills for the state of Iowa as a case study. The relationship between final coverage rates of the sampling frame and how this relates to the misclassification error rates of the neural network models as well as the user specified sampling interval will also be explored. We also discuss additional uses and planned sampling enhancements for the application.


The View From Above - Virtual Listing Using GIS

Ms Michelle Amsbary (Westat)
Mr Richard Dulaney (Westat) - Presenting Author

On-the-ground field listing techniques have traditionally been used to generate sample frames for surveys in which a list of the population is not available or is determined to have inadequate coverage. However, traditional listing is costly, time-consuming and error-prone, as many in the survey research field have pointed out. In this paper, we explore virtual listing using GIS data as an alternative to traditional field listing, in order to construct a within-segment frame of commercial buildings.

Challenges posed by traditional listing – one or two listers physically canvassing an area to identify or validate buildings and addresses – are well-documented in the literature. Listing commercial buildings introduces challenges to field work not often encountered with residential listing. Examples include the need to determine building characteristics such as size or principal usage, or the need to include non-visible sampling criteria such as floor-to-ceiling walls and internal egresses. In some cases, a building may be part of a larger campus, such as a university or a hospital. In other cases a single building such as a strip mall may contain multiple establishments. Ambiguities in listing can create such challenges for data collection as the need to create or merge cases in the field, and can contribute to total survey error.

As the availability and strength of GIS tools and data increase, the opportunities to leverage big data to improve survey efficiency grow as well. We discuss the use of various geospatial resources, such as those provided by Google and other commercial vendors, to develop a Virtual Listing System (VLS) to list commercial buildings as part of the 2018 CBECS Commercial Buildings Energy Consumption Survey (CBECS) frame construction. Virtual listing uses a custom website that displays project geography (e.g., sampled PSUs, segments, Census tracts and blocks), satellite imagery, and maps, allowing listers to canvass a segment remotely. Listers identify a building, research it within the VLS and through Internet sources, determine eligibility, and create a building footprint by tracing the outline of the building’s roof. Locations and attributes of all listed buildings are stored within the VLS.

Virtual listing techniques offer several advantages over traditional field listing, including access to StreetView and 3D view, and ultimately affords greater efficiency than field listing. Augmenting with GIS layers such as point of interest (POI) data allows the virtual lister to access information about locations (e.g., businesses, restaurants, churches, stores, hospitals, schools) such as names, addresses, websites, and photographs, to help the lister determine the type of building.

We will describe the Virtual Listing System and demonstrate several key features. Because this is the first use of the GIS technology for listing at Westat, and because accurate and comprehensive listing is critical to project success, we are undertaking a direct comparison of virtual and traditional on-the-ground listing by comparing virtual and on-the-ground listing in a subset of 50 CBECS segments. We will present analysis on the quality, cost and efficiency of each.