BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Population Estimates for Small Geographic Areas: Can We Do a Better Job?

Chair Dr Safaa Amer (RTI International)
TimeFriday 26th October, 11:30 - 13:00
Room: 40.035 S. Graus

Conducting surveys in developing countries is particularly challenging because many areas lack a complete sampling frame, have outdated census information, or have limited data available for designing and selecting a representative sample. With developments in GIS, remote sensing, and machine learning, tools have emerged to disaggregate and update the distribution of census data to support survey sampling methodology and improve population estimates. For this session, a mix of current developers and users of massive global population datasets will discuss the basis for these population estimates, scope, and limitations. We will also present tools and approaches for utilizing these geo-referenced population estimates for complex household survey sampling (i.e., Geo-sampling and GridSample). Lastly, we will discuss challenges with geographic population distributional assumptions within these areas that are operationally relevant for conducting household surveys, with approaches to help address them using deep neural network models and sample based estimates.

Advances in Gridded Population Distribution Databases: LandScan HD

Final candidate for the monograph

Dr Amy Rose (Oak Ridge National Laboratory) - Presenting Author

Rapid advances in remote sensing, machine learning, and the ever-increasing availability of high-quality spatial datasets continue to provide opportunities to improve global gridded population distribution databases. These gridded population datasets demonstrate a considerable improvement over the conventional use of choropleth maps to represent population distribution and are essential for analysis and planning purposes, including humanitarian response, disease mapping, risk analysis, and evacuation modeling.

Taking advantage of these technological and methodological advancements, the LandScan population distribution project has continued to evolve over the last two decades. Using dasymetric mapping techniques to address spatial mismatch, the LandScan Global model adopts a “top-down” approach to population distribution by spatially disaggregating subnational census counts using ancillary datasets such as land cover, slope, proximity to roads, and settlement locations. Beyond this approach, it’s also necessary to develop finer resolution population distributions in areas of the world where subnational census data are coarse or non-existent. This is addressed through continued scaling of population, settlement, and “neighborhood” mapping algorithms using machine learning, computer vision, and spatiotemporal big data streams that exploit high performance compute (HPC) infrastructure.
In particular, we are now taking advantage of the terabytes of very high resolution satellite imagery collected every day, which has proven to be a highly effective approach for generating accurate human settlement maps. It allows potential characterization of population from settlement structures by exploiting image texture and spectral features. Our recent research utilizing machine learning and high performance geocomputation has defined a new standard for rapid analysis of high resolution imagery to map the spatial extent of human activity on earth.
These settlement structures and accompanying neighborhood segmentations are now the foundation for LandScan HD, where modeling is tailored to the unique geography and data conditions of individual countries or regions by combining social, cultural, physiographic, and other information with novel geocomputation methods. We curate similarities among these geographic areas in order to leverage existing training data and machine learning algorithms to rapidly scale development.

LandScan HD adapts highly mature population modeling methods developed for LandScan Global, settlement mapping research and production in HPC environments, land use and neighborhood mapping through image segmentation, and facility-specific population density models. Adopting a flexible methodology to accommodate different geographic areas, LandScan HD accounts for the availability, completeness, and level of detail of relevant ancillary data. Beyond core population and mapped settlement inputs, these factors determine the model complexity for an area, requiring that for any given area, a data-driven model could support either a simple top-down approach, a more detailed bottom-up approach, or a hybrid approach.


GridSample.org: Generating Household Survey Sampling Units From Gridded Population Data

Final candidate for the monograph

Ms Dana Thomson (WorldPop, University of Southampton, and Flowminder Foundation) - Presenting Author

Download presentation

Household surveys are essential in the era of the Sustainable Development Goals (SDGs) where all countries aim to achieve 17 goals by 2030, including zero poverty whilst leaving no one behind. In low- and middle-income countries (LMICs), household surveys are often the only source of information about dozens of the 233 SDG indicators. Exclusion of the poorest and most vulnerable begins with the use of outdated or inaccurate census sample frames. The average survey is selected from seven year old census data, and census sample frames in at least 1 of 4 LMICs are not usable because they are either more than 15 years old, and/or major population displacement has occurred due to disaster or conflict. The continued use of census sample frames is a key contributor to biased survey results, and limits the downstream survey design and implementation options.

We describe GridSample.org, a tool that supports the design of complex household surveys with gridded population data rather than census data. Gridded population datasets are either (a) generated from models that disaggregate census data to small grid-squares (for example, WorldPop’s 100m X 100m datasets) based on dozens of satellite and geographic datasets, or are (b) generated directly from satellite imagery in the absence of quality census data. GridSample.org is a free point-and-click sample selection tool that leverages existing gridded populations and other datasets, and supports one-stage or two-stage cluster designs, stratification, and oversampling in urban/rural areas. The user chooses whether to sample with probability proportional to size in grid squares of a specified geographic size (100m X 100m or larger), or from pre-defined areas built from grouped cells with a specified population total. Furthermore, the tool supports spatial oversampling, which may improve survey-based small area estimates. The output of GridSample.org is a shapefile and KML file of primary sampling units (i.e., clusters), which can be viewed in ArcGIS or Google Earth, and a pre-populated spreadsheet to calculate sample weights.
Gridded population sampling has been used in a number of household surveys since 2010. We show four approaches to implementation of gridded population samples that have been used by various teams in practice. We also recommend specific tools and protocols to increase the accuracy of gridded population samples (or census-based samples) in dynamic LMIC urban settings.


Residential Scene Classification for Geo-Sampling in Developing Countries Using Deep Convolutional Neural Networks on High-Resolution Satellite Imagery

Mr Robert Chew (RTI International) - Presenting Author
Dr Safaa Amer (RTI International)
Mr Kasey Jones (RTI International)
Ms Jennifer Unangst (RTI International)
Mr James Cajka (RTI International)
Ms Justine Allpress (RTI International)
Mr Mark Bruhn (Independent Researcher)

Download presentation

Nationally representative survey samples are needed for studies in low- and middle-income countries (LMICs) to support decision making in research areas ranging from international development to public health. To fill this demand, researchers are developing new and innovative methods that facilitate probability-based survey samples in developing countries at a reasonable cost. One such method, Geo-sampling, is a gridded population sampling method that uses geographic information systems (GIS) to partition areas of interest into logistically manageable grid cells for sampling. GIS grid cells are overlaid to partition a country's existing administrative boundaries into area units that vary in size from 50m x 50m to 150m x 150m. These smaller grid units are then sampled and interviewers are sent to selected areas to contact household respondents.

To avoid the costly procedure of sending interviewers to unoccupied areas, researchers manually classify grid cells as "residential" or "nonresidential" through visual inspection of aerial images prior to sending interviewers to the field. "Nonresidential" units are excluded from sampling and data collection. This process of manually classifying sampling units has drawbacks since it is labor intensive, prone to human error, and creates the need for simplifying assumptions during calculation of design-based sampling weights. In this paper, we discuss the development of a deep learning classification model to predict whether aerial images are residential or nonresidential, thus reducing manual labor and eliminating the need for simplifying assumptions.

On our test sets, the model performs comparable to a human-level baseline in both Nigeria (94.5% accuracy) and Guatemala (96.4% accuracy), and outperforms baseline machine learning models trained on crowdsourced or remote-sensed geospatial features. Additionally, our findings suggest that this approach can work well in new areas with relatively modest amounts of training data.

Gridded population sampling methods like geo-sampling are becoming increasingly popular in countries with outdated or inaccurate census data because of their timeliness, flexibility, and cost. Using deep learning models directly on satellite images, we provide a novel method for sample frame construction that identifies residential gridded aerial units. In cases where manual classification of satellite images is used to (1) correct for errors in gridded population data sets or (2) classify grids where population estimates are unavailable, this methodology can help reduce annotation burden with comparable quality to human analysts.


Household Detection Within Gridded Population Area Units: Producing Small Area Population Estimates in Geo-Sampling

Final candidate for the monograph

Dr Safaa Amer (RTI International) - Presenting Author
Mr James Cajka (RTI International)
Mr Rob Chew (RTI International)
Mr Kasey Jones (RTI International)
Ms Jennifer Unangst (RTI International)
Ms Justine Allpress (RTI International)

While gridded population data sets use information from aerial imagery to estimate population, the distribution of the population estimate within a grid unit is not necessarily uniform (i.e., the modifiable areal unit problem (MAUP)). To create units that are operationally relevant for conducting household surveys in LMICs, sometimes population estimates and household counts are needed at areas smaller than what is typically found in gridded population datasets with a global coverage. To further enhance gridded population sampling methods, machine learning models designed for object detection tasks are being developed to identify and count the number of structures per grid cell, with the goal of using these counts as a proxy for soft population estimates. Sampling buildings within smaller grid cells, rather than covering all households within the original sampled cells, will lead to lower levels of clustering. In this case, the size of the cluster is estimated through the number of buildings detected within each of the grid cells, the approximate building footprint for detected structures, and assumptions about the number of households per building, allowing for the use of probability proportional to size and enhanced calculation of weights.

As a case study, we will implement this model in an area previously surveyed using the Geo-sampling methodology in Nigeria. Our approach will be evaluated in two ways: 1) in order to confirm that we correctly counted buildings, we will manually label validation images and test our accuracy, and 2) we will be able to see if there is any correlation between the number of buildings in a PSU and an area’s population by using PSU-level population estimates provided by Landscan. In this paper, we will highlight the challenges and limitations faced while trying to produce small geographic area population estimates and discuss how we can use these estimates to enhance Geo-sampling. Additional considerations will be raised to discuss the implications on sampling weights and population estimates.