BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Are the Machines Making the Mark? Applications to Compare Traditional and Machine Learning Models

Chair Dr Curtis Signorino (University of Rochester)
TimeFriday 26th October, 09:45 - 11:15
Room: 40.035 S. Graus

Empirical Comparison of Time Series Data Mining Algorithms

Dr Sakinat Folorunso (Olabisi Onabanjo University) - Presenting Author
Dr Abass Taiwo (Olabisi Onabanjo University)
Dr Timothy Olatayo (Olabisi Onabanjo University)

Time series is a sequence of observed data that is usually ordered in time. Data mining is a well-researched subfield of computer science. Time series data mining is the innovative application of the principles and techniques of data mining in the analysis of time series.

The aim of this research is to apply data mining techniques to forecasting time series data. Electric Power consumption data consumed by Nigerians measured monthly from January 2001 - September 2017 will be considered here. Experiments are conducted using four data mining techniques: Random Regression Forest (RRF), Linear Regression (LR), Support Vector Regression (SVR) and Artificial Neural Network (ANN) which were evaluated based on their forecasting errors generated: Mean Squared Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and prediction accuracy. This list of regressors was chosen in order to perform regression analysis and make predictions for particular values of the data set.

Open-source Data mining suite Waikato Environment for Knowledge Analysis (WEKA) was adopted and all experiments were performed there.

The combination of parameters that yields the best results in terms of predefined performance criteria were chosen as optimal for each regressor. A comparative analysis of the regressors’ performance was conducted.

All the tested regressors have demonstrated the best prediction quality on short periods of time. SVR demonstrated the best results in terms of both error values and time expenses.


Real-Time Estimation of Unemployment With Dynamic Factor and Time-Varying State Space Models

Professor Franz Palm (Maastricht University)
Dr Stephan Smeekes (Maastricht University)
Dr Jan van den Brakel (Statistics Netherlands)
Ms Caterina Schiavoni (Maastricht University) - Presenting Author

Download presentation

Estimation of unobserved components is considered in high-dimensional state space models using a dynamic factor approach. Our method allows for variables to be observed at different frequencies and updates the estimation when new information becomes available. In addition, we account for potential time variation in the parameters of the model. We apply the methodology to unemployment estimation as done by Statistics Netherlands, who uses a multivariate state space model to produce monthly figures for the unemployed labour force using series observed with the Labour Force Survey (LFS). We extend the model by including auxiliary series about job search behaviour from Google Trends and claimant counts, partially observed at higher frequencies. Our factor model allows for nowcasting the variable of interest, providing unemployment estimates in real time before LFS data become available. In addition our method accounts for time-varying correlations between the LFS and auxiliary series.


Machine-Learning Techniques for Family Demography: An Application to the Divorce Determinants in Germany

Dr Bruno Arpino (Universitat Pompeu Fabra) - Presenting Author
Dr Marco Le Moglie (Bocconi University)
Professor Letizia Mencarin (Bocconi University)

Download presentation

Demographers often analyze the determinants of life-course events with parametric regression-type approaches. Here, we present a class of nonparametric approaches, broadly defined as machine learning (ML) techniques, and discuss advantages and disadvantages of a particular type known as random forest. We argue that random forests can be useful as a substitute or a complement to more standard parametric regression modeling. We illustrate its implementation by analyzing the determinants of divorce with GSOEP data (for women entering in a marriage from 1984 to 2015). The algorithm is able to classify divorce determinants according to their importance, highlighting the most powerful ones, which in our data are partners' subjective well-being, their age, and also certain personality traits (i.e., extroversion of the partner and – though with less power – also women’s conscientiousness, agreeableness and openness). We are also able to draw partial dependence plots for the main predictors of survival of the relationship.


Comparison of Simple and Complex Predictive Models Applied to the National Surveys on Drug Use and Health. Example of Multiple Visits to Emergency Departments.

Dr Georgiy Bobashev (RTI International) - Presenting Author
Dr Li-Tzy Wu (Duke University)

Repeated cross-sectional national surveys provide rich resources to calibrate and validate predictive models. We examined how well demographics, substance use, mental and other health indicators from multiple years of National Surveys on Drug Use and Health (NSDUH) predict individual propensity for multiple visits (<3 per year) to Emergency Departments (ED). We considered a range of years from 2007 to 2015. Models were developed on some years’ data, validated and tested on the others. We compared performance of the backwards, stepwise logistic regressions, LASSO, classification trees, combination of trees and regressions, neural networks and random forest models.

Areas under the curve (AUC) on a test set was 0.79, which is good for a national survey. Models revealed consistency in selecting predictors over multiple independent datasets, but showed sensitivity to variable selection method. Different variable selection criteria can be used to either emphasize the overall performance or to identify small but high risk subpopulations "hotspots". We examined the complexity of the models by calculating effective degrees of freedom using a Monte-Carlo approach. Low complexity of the best performing models is related to the minimal sample size considerations for model prediction and variable selection. While consistency in the choice of top predictor was not affected by an increase in the sample size above a certain level, the identification of critical population subgroups benefits from larger samples. Sensitivity analysis showed low sensitivity to the use of sampling weights.