BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Applying Machine Learning and Automation to Improve Imputation - Replicate II

Chair Dr Mansour Fahimi (GfK)
TimeSaturday 27th October, 16:00 - 17:30
Room: 40.010

AI and Machine Learning Derived Efficiencies for Large Scale Survey Estimation Efforts

Final candidate for the monograph

Dr Steven Cohen (RTI International) - Presenting Author
Dr Jamie Shorey (RTI International)

Download presentation

A high degree of rigor is essential in the statistical integrity of “end-product” analytic resources that are used to inform policy and action. In this vein, statistical and analytic staff devote substantial time and effort to implement estimation and associated imputation tasks, which are essential components of the “end-product” analytic databases derived from national or sub-national surveys and related data collections. These efforts require a substantial commitment of project related funds to achieve and significant lag times often exist from the time data collection is completed to the time the final analytical data file is released. This presentation focuses on the development and implementation of artificial intelligence (AI) and machine learning enhanced applications to imputation for specific national survey efforts that achieve efficiencies in terms of cost and time while satisfying well defined levels of accuracy that ensure data integrity. Attention is given to enhanced processes that: serve as an alternative solution to manual, repetitive or time-intensive tasks; operationalize decisions based upon predefined outcome preferences and upon access to input data that sufficiently informs the decisions; facilitate real-time interpretation and interactions for accessing and acting upon the AI-derived decisions to permit the user to focus energy on higher-order thinking and problem resolution. Our approach includes the framing of predictions of criterion variables and their distributions as a multi-task learning (MTL) problem. MTL jointly solves multiple learning tasks by exploiting the correlation structure across tasks. Consideration is also given to the application of random forest methods which utilize an ensemble of decision trees to facilitate predictions.

Examples are provided with applications to national survey efforts that include the Medical Expenditure Panel Survey (MEPS). MEPS is a large scale annual longitudinal national based survey that collects data on health care use, expenditures, sources of payment, and insurance coverage for the U.S. civilian noninstitutionalized population. This research effort focuses on harnessing AI/ML techniques to yield MEPS expenditure data and estimates that are closely aligned with the actual results that require several months to produce and are provided in the MEPS final analytic files. The methods performance is evaluated based on the medical expenditure data sets released as public use files, which are regarded as the reference standard in the evaluation phase of this study.


Sequential Imputation of Missing Data in High-Dimensional Data Sets

Mr Micha Fischer (University of Michigan) - Presenting Author

Download presentation

Multiple imputation with sequential regression models is often used to impute missing values in data sets and lead to unbiased results if the missing data is missing at random and models are correctly specified. However, in data sets where many variables are affected by missing values, proper specifications of the sequential regression models can be burdensome and time consuming, as a separate model needs to be developed for every variable by a human imputer.

Even available software packages for automated imputation procedures (e.g. MICE) need model specifications for each variable containing missing values. Additionally, using their default models can lead to imputation bias when variables are non-normally distributed.

This research aims to automate the process of sequential imputation of missing values in high-dimensional data sets accounting for potential non-normality in the data. The proposed algorithm performs model specification by selecting the best imputation model from several parametric and non-parametric models in each step of the sequential imputation procedure. The best imputation model for an outcome variable achieves the highest similarity between imputed and observed values after conditioning on the response propensity score for the outcome variable.

A simulation study investigates in which situations this automated procedure can outperform the usual approaches (e.g. MICE, IVEware). This method is also assessed on survey data linked to administrative records by comparing results from imputed survey data with gold standard estimates from complete administrative data sources.


Approximate Nearest Neighbour Imputation

Dr Maciej Beręsewicz (Poznań University of Economics and Business / Statistical Office in Poznań) - Presenting Author
Mr Tomasz Hinc (Poznań University of Economics and Business / Statistical Office in Poznań)

Download presentation

No data source is ideal, and the problem of missing data is one of the most common challenges that need to be addressed. Censuses, surveys, administrative records as well as new data sources, such as the Internet or big data, suffer from unit or item non-response. One popular method designed to create a complete dataset is k-nearest neighbor (KNN) imputation, in which a vector of auxiliary variables is used to determine the nearest neighbor. The nearest neighbor is then used as a donor for hot deck imputation.

Currently used KNN imputation methods involve ways of determining the distance between a unit with missing data and similar units, and imputing missing values from a selected set of k donors. Often, for computational reasons, we create imputation classes based on sex, age group or locality, and then imputation is performed within these classes.

Massive sizes of new data sources require techniques that are computationally effective. Unfortunately, most methods currently used for imputation are not suitable. Therefore, it is necessary to turn to alternative methods, one of which is approximate nearest neighbours (ANN). The ANN can be regarded as a group of various heuristics, which are used to perform data structure transformations in order to find units which are closest to the given one, within a certain radius. The method also provides a set of metrics to determine the degree of similarity between them.

However, the ANN method is not used in survey practice, even though nearest neighbor imputation is a popular technique for handling item non-response in survey sampling. One possible reason is that these methods are new to statisticians; also properties of estimators based on data imputed by means of ANN have not been analysed. In the article, we seek to fill these two gaps.

In particular, we review state-of-the-art implementations of the ANN algorithms and imputation methods based on k-nearest neighbours. Further, we analyse properties of ANN algorithms for imputing missing data by means of simulation studies based on real data sets.