BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Big Data Applications to Enterprise Statistics: Businesses, Employers, and Consumers

Chair Professor Mark Trappmann (IAB, University of Bamberg)
TimeSaturday 27th October, 11:00 - 12:30
Room: 40.063

Synthesising Big Data and Business Survey Data

Final candidate for the monograph

Mr Matthew Greenaway (Office for National Statistics (U.K.)) - Presenting Author

The UK Office for National Statistics (ONS) are looking to expand our use of alternative data sources, including ‘big’ data, to improve how we measure the modern economy and society.

This presentation will summarise the experience of the ONS Big Data team in bringing data about businesses from the web – for example, data from business websites or data from jobs portals - and traditional business survey data together to produce new or improved statistics about the UK business population. We will present three case-studies:

Case Study 1 - Combining survey data and business website data to improve how we measure business characteristics which are currently captured using a survey - such as the adoption of e-commerce. We will outline the different steps in the pipeline we’ve used to obtain the web data – including extracting information from websites and matching websites to businesses - and the methodologies we’ve investigated at each of these steps, including the use of supervised machine learning. We’ll then discuss the utility of the resulting web data and how it may be used alongside survey data.

Case Study 2 - Combining business register data and business website data to produce research into new types of business activity to supplement standard business classification systems such as NACE. Again, we will outline the steps in our pipeline and elaborate on the techniques we have used, which include natural language processing and clustering to identify ‘topics’, and discuss how the results may be used alongside survey data.

Case Study 3 - Combining survey data and jobs-portal data to produce more timely estimates of UK job vacancies. Here, we will particularly focus on our research and evaluation into issues related to the representivity of the jobs portal data.

We will then elaborate on the challenges and opportunities which cut across these case-studies, briefly discuss technical challenges, such as those involved in linking and matching millions of records, and finally we’ll touch on how we have navigated the challenging ethical issues associated with this research.


Consumer Expenditure Statistics From Retail Transaction Data

Mr Sverre Amdam (Statistics Norway) - Presenting Author
Mr Henning Holgersen (Statistics Norway)
Dr Bart Buelens (Statistics Netherlands)

The potential of micro-level transaction data from retail chains as a basis for household budget statistics is evaluated. Currently, such statistics are predominantly survey-based in many countries. However, many statistical agencies are struggling with increasing non-response and high survey-costs, creating the need to look for alternative sources for producing household budget statistics.

An explorative analysis of transaction data is presented from a methodological perspective, using a sample of receipt data including information on loyalty card holders from one of the largest grocery retail chains in Norway, containing over 121 million transactions, about half of them conducted by card holders. Expenditure patterns of card holders and non-card holders are compared with the Norwegian consumption total, which is known from Consumer Price Index statistics (CPI). Card holders have a spending pattern that is different from non-card holders, and different from the general population as presented by CPI. Not only are the card holders a specific subpopulation when it comes to spending, there are also demographic differences: on average, card holders are substantially older and slightly more often female than the general population of Norway. The observed skewness of the distribution of card holders with respect to these demographics suggests that an adjustment for these variables may remove some of the selection bias seen in the expenditure pattern. We found that this is not the case. Adjusting for age and gender only has a very small effect on the resulting spending pattern. We conclude that the members are a specific subpopulation with a spending pattern unlike that of the average Norwegian, and that this cannot be explained by age and gender. There must be other factors at play that are not available in the data used in this analysis. More is required to render this data suitable for unbiased estimation of consumer expenditure. We outline ideas to expand upon the present results.


Fuzzy Identification, from Raw Survey Data to a Structured Register: An Example from Official Statistics Finding the Employer Declared During the Census in the Companies Register

Mr Benjamin Sakarovitch (INSEE) - Presenting Author
Mrs Julie Djiriguian (INSEE)

Download presentation

Identifying an answer to a survey in a register has many applications from record linkage to automatic codification. Indeed finding the closest record in the register from a free declaration remains a challenging task. For instance during the census, surveyed people are invited to declare where they work providing three elements : the name of the company, its address and its field of activity. From that information it is uneasy to find the exact employer in the companies register. Besides it is a very costly process, in the case of the French national statistical institute (INSEE), as more than half of the work is presently done manually.
Hence the methodology presented in this communication aims at improving the automatic identification in the companies register. The process mainly consists of three phases : data enrichment, passing queries on the register and finally scoring the results.
The first step is to add information both to the answer to the survey and to the companies register. The different addresses are geocoded so that geographical reconciliation can be operated. It allows for geographical reconciliation without consideration for approximation in the declaration or administrative limits. This is an enrichment with the help of an extra structured database linking addresses and coordinates. A complementary approach for enriching the data consists in retrieving information from the web through web-scrapping. In this case we benefit from other companies registers, for phone contacts for instance, that are available online with a search engine. It gives access to a normalized name of the company. Web-crawling also provides possible alternative names for the firm from the unstructured information present on internet. Moreover a third method for enriching the data is through concept extraction. By projecting the census declaration about firm activity and profession into a limited vocabulary it is made possible to link to firms through broader concepts.
Once the data has been cleaned and enriched the next phase for massive identification is to index the register so as to be able to feed a search engine, such as SolR or Elastic Search. These technologies have the advantage to be finely tunable. The distance on character chains can be customized adapting the Levenshtein-Damereau one for example. The query to find an element trough the search engine may use different number of attributes, successively adding more and more.
Finally the last part of the process is to score the quality of echos given by the search engine. It is an important step as the goal is not only to fined the most likely element in the register but also to know how trustworthy the identification procedure has proved.


Identifying Innovative Companies From Their Website

Miss Suzanne van der Doef (Statistics Netherlands)
Dr Piet Daas (Statistics Netherlands) - Presenting Author
Mr Dick Windmeijer (Statistics Netherlands)

Getting an overview of the innovative companies in a country is a challenging task. One of the ways of doing this is setting up a survey to contact a sample of companies; for instance, by phone or via a questionnaire. The response can be used to derive how many innovative companies there are in a country or area. This approach, however, puts a burden on companies and may result in a considerable non-response. Another downside is the fact that usually the focus of such a survey is on large companies and less on smaller companies. We therefore looked for an alternative approach and came up with the idea of determining if a company is innovative by studying the text on the main page of their website. To enable this the following steps were applied, namely:

I) Selecting a set of known innovative and non-innovative companies;
II) Making sure that for each company the corresponding URL of their web site is available;
III) Scraping the main page of each web site and pre-processing the text displayed;
IV) Developing a model to determine if a company is innovative or not based on the pre-processed texts.

We started with a sample of 3000 innovative and 3000 non-innovative companies according to the Community Innovation Survey of Statistics Netherlands. The first thing observed was that two-thirds of the URL’s of the companies selected were absent in the business register. These URL’s were added via the URL finding approach developed in WP2 of the ESSnet Big Data (Deliverable 2.2). Since the companies included in the survey all had 10 or more employed persons, we decided to additionally add a considerable number of smaller innovative companies. Here, Dutch companies listed in the yearly SME-innovation top-100 for the years 2008-2017 were used; any duplicates were removed. Next, the text displayed on the main web page of each company was scraped with Python. After language detection (usually Dutch or English), punctuation marks and stop words were removed and the remaining words were stemmed. This was used as input for the model. Here, it was found that logistic regression with L1-norm performed well. With a 70%-30% training and test set, the trained model was able to determine if a company was innovative or not with 93% accuracy. However, special attention needed to be paid to two character length words. Excluding them resulted in a decrease of the model's accuracy to 63%. Web pages that displayed a lot of email-addresses and URL’s were found to produce large amounts of these words. As a result, including two character words would make the approach developed very sensitive to such features. Therefore, it was decided to focus on an approach solely using words of three character lengths or more. Here the combination of unigrams and word embeddings performed well; resulting in an accuracy of 91%. More details will be described in the paper and in the presentation.


Requirements in Job Advertisements: Automated Detection and Classification Into a Hierarchical Taxonomy of Work Equipment (Tools)

Mr Manuel Schandock (Federal Institute for Vocational Education and Training (BIBB)) - Presenting Author

Download presentation

In labour market research we face a serious lack of empirical information about ongoing trends for the employers' demand for competences, skills and experience with technical equipment. In Germany we can use a wide range of labour market related surveys and process-induced data, collected and provided by federal institutions like the Federal Statistical Office (DESTATIS) or the Institute for Employment Research (IAB), for example. But these data suffer from a strong time lag from data collection until data access (surveys). And if they don‘t, deep information about job requirements is missed (process induced data). Job advertisements in contrast are a rich source of information. They are very up to date. And they are convenient and quite cheap to collect – in the case of using online sources. But there is also a challenge: The vast amount of information in job advertisements is completely unstructured. Hence we have to deal with natural language texts and we have to dig for the information before we analyze the data (data mining).

We will present a complex workflow for the data mining of the employees' experience with technical equipment (tools) required by employers and stated in job ads (Information Extraction). We developed a hierarchical taxonomy with almost 60 items and present a framework to classify the detected tools into this taxonomy. The data we use is based on a large number of job ads hosted by the German Federal Employment Agency from 2011 up to 2017 with an overall amount of more than 3.000.000. In addition we use an access to more than 10.000.000 web crawled job ads from 2014 until now.

We aimed to

1. develop an empirical based classification of tools,
2. enrich that classification with a large number of exactly named tools for
3. the productive usage in CATI or CAPI questionnaires (automated coding).

The presentation will cover

1. a short introduction of the data we use,
2. a look into the different phrases being used to state single employers' needs,
3. the algorithm to detect all these different phrases in thousands of Job ads,
4. the algorithm to sum up different phrases into canonical phrases,
5. the machine learning approach to classify this canonical phrases into a taxonomy of tools and
6. the quantity structure in our data.

Short abstract:
In labour market research we face a serious lack of information about ongoing trends for the employers' demand for competences, skills and experience with technical equipment (tools). We will present a complex workflow for the data mining of the employees' experience with tools required by employers and stated in job ads. A taxonomy is developed and a machine learning framework to classify the detected tools into this taxonomy will be presented. We aim to enrich that classification with a large number of exactly named tools for the productive usage in CATI or CAPI questionnaires (automated coding).