BigSurv20 program


Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back

Official statistics in the era of big data

Moderator: Sofie de Broe (smmg.debroe@cbs.nl)
Slack link
Quick Zoom

Detailed zoom login information
Friday 13th November, 11:45 - 13:15 (ET, GMT-5)
8:45 - 10:15 (PT, GMT-8)
17:45 - 19:15 (CET, GMT+1)

Updating the paradigm of official statistics: New quality criteria for integrating new data and methods in official statistics

Dr Sofie De Broe (Centre for Big Data Statistics, Statistics Netherlands) - Presenting Author
Dr Peter Struijs (Statistics Netherlands)
Dr Piet Daas (Statistics Netherlands)
Dr Arnout van Delden (Statistics Netherlands)
Dr Joep Burger (Statistics Netherlands)
Dr Jan van den Brakel (Statistics Netherlands)
Dr Olav ten Bosch (Statistics Netherlands)
Dr Kees Zeelenberg (Statistics Netherlands)
Mr Winfried Ypma (Statistics Netherlands)

This paper aims to elicit a discussion of the existence of a paradigm shift in official statistics through the emergence of new (unstructured) data sources and methods that may not adhere to established and existing statistical practices and quality frameworks. The paper discusses strengths and weaknesses of several data sources, methodological, technical and “cultural” barriers (as in the culture that reigns in an area of expertise or approach) in dealing with new data and methods in data science and concludes with suggestions of updating the existing quality frameworks. Statistics Netherlands takes the position that there is no paradigm shift but that the existing production processes should be questioned and that existing quality frameworks should be updated in order for official statistics to benefit from the fusion of data, knowledge and skills among survey methodologists and data scientists.

Evaluating and improving a text classifier for subpopulations: the case of cyber crime

Dr Arnout van Delden (Statistics Netherlands) - Presenting Author
Dr Dick Windmeijer (Statistics Netherlands)
Ms Carlijn Verkleij (Statistics Netherlands)

Download presentation

With increasing digitalization reliable figures about cybercrime intensity and their changes over time are very relevant for society. At Statistics Netherlands (CBS) figures about cybercrime are currently based on ‘the safety monitor’. In this sample survey people are asked whether they were victim of a certain crime types during the past 12 months, including cyber related crimes. Disadvantage of the survey is that not all victims recall the occurrence of a cybercrime. A potential additional source is the registration of crimes reported to the police (PRD). Within PRD, the police classified the reports by their main offence, and some of them concern forms of cybercrime. The remaining main offences may also contain cyber-related aspects. CBS was asked by the police whether it is possible to use text mining to estimate those cyber-related aspects, using the story of the declarants in the PRD as well as a clarification field filled in by the police. In 2019, CBS has developed a first version (beta-product) of a text mining model, which is based on a support vector machine with bag of words approach. With this base model, CBS estimated that about 10% of the reported crimes in the Netherlands contain cyber-related aspects. This base model had a micro F1 score of 0.98, tested on a random test sample of 300 crime reports based on the PDR of 2016.
With this base model a prediction of the outcome variable can be made for all units in the population. CBS is therefore interested to link background variables of the victim to the PDR such as gender, date of birth and education level in order to publish a cross table of cyber intensity by crime type and by background variables of the victims. The question however is whether such output is reliable. In the current paper we limit ourselves to cyber intensity by crime type.
We address three issues. Firstly we identify which quality aspects are needed to determine whether the model can be used in official statistics to estimate cybercrime intensity by crime category. We limit ourselves to the following quality aspects: interpretability, stability, micro level prediction performance (F1 and entropy), bias and variance of estimated totals and domain specificity. Each of those aspects will be explained further in the presentation. Secondly, we test how well the beta product performs on each of those aspects. Thirdly, we test the effect of a number of adaptations to the beta product on the score of the derived quality measures. Among others, we test the effect of retraining the base model on different combinations of crime categories. We end the paper by discussing which next steps are needed to move from a beta product to official statistics.

Detecting innovative companies via the text on their website

Dr Piet Daas (Statistics Netherlands) - Presenting Author
Dr Suzanne van der Doef (Statistics Netherlands)

Download presentation

Getting an overview of the innovative companies in a country is a challenging task. Traditionally, this is done by sending a questionnaire to a sample of companies. This approach, however, puts a burden on companies, may result in a considerable non-response and usually only focuses on large companies. We therefore investigated an alternative -Big data oriented- approach: determining if a company is innovative by studying the text on the main page of its website. For this task a logistic regression model was developed based on the texts of the websites of companies included in the Community Innovation Survey of the Netherlands. The latter is a standardized survey carried out every two years in a whole range of European countries that focusses on the detection of innovative companies with 10 or more working persons; e.g. the large companies. We developed a text-based model that can be used to classify the websites of both large and small companies. Before these findings could be adequately checked two major issues needed to be solved. The first was model stability which was dealt with by updating the model with newly classified cases followed by retraining. This increased the number of words included in the model and enabled its application over a longer period of time (more on this in the presentation). The second issue was correcting for the bias in the model-based results. This was needed as: i) not all innovative companies have a website, ii) not all websites contained sufficient words to be classified, and iii) the model-based approach introduced a bias because the ratio between de false positive and false negative cases was unequal. Both were solved and, after scraping nearly all web pages of all Dutch companies, the findings for both large and small companies in the Netherlands were determined. With the model-based approach, the result from the Community Innovation Survey could be reproduced; i.e. a similar number of large innovative companies were detected. The results of the model on small innovative companies were checked by validating the model on the websites of a list of startups and on a large set of small companies from the Business Register. The findings of the latter were -double blinded- manually checked for a random sample of a 1000 companies. Detailed results for companies with large and small numbers of employees, examples of new products including some visualizations and the issues ran into during the preparation of the first official numbers are discussed in the presentation.



Understanding the difference in freight transport estimates with and without road sensor data

Mr Jonas Klingwort (University of Duisburg-Essen & Statistics Netherlands) - Presenting Author
Dr Joep Burger (Statistics Netherlands)
Dr Bart Buelens (Vlaamse Instelling voor Technologisch Onderzoek)
Professor Rainer Schnell (University of Duisburg-Essen)

Download presentation

Capture-recapture (CRC) is currently considered as a promising method to use big data in official statistics. There are first applications, but further research on the validity of developed estimators is required. We previously applied CRC to estimate road freight transport with survey data as first capture and road sensor data as second capture, using license plate and time-stamp to identify re-captured trucks. A considerable difference was found between the design-based survey estimate and the model-based CRC estimate. One of the possible explanations is underreporting in the survey, which is conceivable given the high response burden of diary questionnaires. In this paper, we study alternative explanations to quantify their effect on the CRC estimates to make a stronger case for the use of CRC in assessing survey underreporting.

First, truck owners are asked to report the day of loading, whereas the sensors measure the day of driving. To study the effect of this mismatch, we simulated that the truck was driving on all days following the reported day of loading. Although this decreased the estimated difference between the survey and CRC estimate, the difference would not drop below about 10%. The opposite, overreporting error corrected for by collapsing
multiple days of loading to the first reported day only increased the difference.

Second, putative underreporting by survey respondents could be overdetection by sensors. Detected trucks may not have to be reported, for instance, because they are empty or drive for maintenance. Unlikely high proportions of false positives can completely explain the difference, but the difference remains substantial at more reasonable, albeit unknown rates. The difference can be shown to be robust against linkage errors and sensor failure, although precision is compromised.

Third, trucks reported not owned are considered a frame error, assuming they have been scrapped or exported. If they are considered nonresponse, and none of them is detected by a sensor, the estimated underreporting would shrink by about 6%-point, still leaving about 12% to 17% unexplained. If none is detected by a sensor, however, it makes more sense to treat them as frame error. If all are detected, the difference would not change.

Fourth, if the difference would be caused by underreporting in the survey, this would only be apparent in the manual response modes (web and paper) and not in the automatic response mode (XML). Estimating the difference by response mode indeed showed that the difference is much lower in XML than in the manual modes. The remaining difference in XML, however, suggests that 12%-15% can be attributed to underreporting. The remaining difference could, for instance, be a nonresponse error not corrected for by poststratification. The CRC model selection supports this hypothesis by choosing variables currently not included in the post-strata.

In conclusion, alternative hypotheses are unlikely to fully explain the difference between a design-based survey estimate and a model-based CRC estimate combining survey and sensor data. Underreporting remains a likely explanation, although the precise amount is hard to estimate.



Inferring a transport network from road sensor data without a sampling design

Dr Jonas Klingwort (University of Duisburg-Essen)
Dr Joep Burger (Statistics Netherlands) - Presenting Author
Dr Bart Buelens (VITO)
Professor Rainer Schnell (University of Duisburg-Essen)

In official statistics, big data potentially enables to produce statistics cheaper, faster, and on a higher level of detail. In contrast to traditional sample surveys, however, big data typically lacks a sampling design: not every element in the population has a known and positive probability of being observed. Correcting for missing data using design-based inference methods is therefore not possible. New methodology is needed to use big data for population inference in official statistics.

In our application, we aim to infer the truck traffic distribution in the Dutch road network from sensors. Not all road segments have sensors, and road segments with sensors have not been randomly sampled. We, therefore, model the probability of detecting a truck on a road segment as a function of features of the network, truck and truck owner using logistic regression. The modeled relationship is used to predict the detection probability for all trucks in the vehicle register for all road segments. The modeled probabilities are multiplied with the number of trucks registered in the national vehicle register, resulting in edge counts and corresponding shipment weights. An important assumption is that data are missing at random, i.e., missing data can be explained by the features of the network, truck and owner.

The Dutch transport network was constructed by web scraping interchange road junctions (the vertices) and their connecting freeways (the directed edges) from www.wegenwiki.nl. Six vertex features were computed: degree (number of incoming and outgoing edges), strength (total weight of incoming and outgoing edges), betweenness (number of shortest paths passing through), closeness (inverse of the mean length of the shortest paths to all other vertices), vulnerability (loss of efficiency when removed) and clustering coefficient (degree to which neighboring vertices are interconnected), using the inverse haversine distance between vertices as the edge weight. More realistic vertex feature values were obtained by expanding the network with neighboring freeways in Belgium and three Northwest-German states. The most important edge feature is general traffic intensity from an extensive road sensor network generating aggregate counts.

In the Dutch network, the ministry of infrastructure has installed a Weigh-in-Motion sensor system to detect overloaded trucks. The system consists of 18 sensors: nine locations with two sensors in either direction. Each sensor was assigned to an edge of the graph using its geolocation. When a truck passes a sensor station, it is weighed, classified and a photograph of the front license plate is taken, which is used to identify Dutch trucks and link information about the truck and truck owner from the vehicle and enterprise registers.

Preliminary results using only vertex features and a single week shows a strong correlation between actual and predicted counts. Cross-validation shows, however, that the model currently does not generalize well yet to other edges. Potential improvements, including more features and more weeks, are currently under study. Our approach yields new statistics and can potentially speed up the production of current road freight statistics.