BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Putting Text into Context: Exploring Classification and Automation of Textual Survey Data

Chair Professor Matthias Schonlau (University of Waterloo)
TimeFriday 26th October, 11:30 - 13:00
Room: 40.154

A comparison of automatic algorithms for occupation coding

Mr Malte Schierholz (Institute for Employment Research (IAB)) - Presenting Author

Download presentation

Occupation is a core organizational principle in our society. Yet, the measurement of occupation in surveys is error-prone and time-consuming. The commonly recommended approach is to ask respondents two or three open-ended questions about their jobs. The answers are coded afterwards into an occupational classification like the International Standard Classification of Occupation (ISCO-08) or the German Classification of Occupations (KldB 2010).

The workload for coding hundreds of thousands of answers after data collection is considerable and its automation is highly desirable. One approach is computer-assisted coding where a computer program suggests candidate job categories and a human coder decides for the most appropriate one. Further automation is achieved when the computer is allowed to choose the final category without human intervention.

However, high-quality coding is not always possible since some verbal answers are vague and ambiguous. As a remedy, Schierholz et al. (2018) proposed coding answers at the time of the interview, making use of technology commonly used in Big Data analysis. Like in computer-assisted coding, an algorithm suggests candidate job categories, but now the respondent is prompted to select the most appropriate category, giving her the possibility to specify her occupation. For further improvement, the authors recommended that the algorithm should account better for short and noisy input texts. The current paper presents a new algorithms to pursue this idea.

Several researchers have developed algorithms to automate occupation coding, either during or after the interview. This includes algorithms from Creecy et al. (1992) and Gweon et al. (2017), who find memory-based reasoning and nearest neighbor approaches to work well. Measure (2014) concludes that Support Vector Machines and multinomial logistic regression perform better than naïve Bayes classifiers. In general, few comprehensive comparisons have been made and little is known about optimal algorithms for occupation coding.

Against this background, this paper makes two major contributions:

1. We compare different algorithms for automatic occupation coding, including regularized logistic regression, gradient boosting, nearest neighbors, and memory-based reasoning.
2. We introduce a new algorithm that is based on string similarities and uses Bayesian ideas. Preliminary results suggest that it compares favorably with other methods.

Predictive algorithms can be used for fully automated coding, computer-assisted coding and interview coding. Our comparison acknowledges that these three tasks are different and each task requires its own metrics for evaluation. We use data from four different surveys, but in principle the same algorithms can be used to classify texts from online job portals and other data sources as well.

References

Creecy, Masand, Smith, and Waltz (1992). Trading MIPS and memory for knowledge engineering. Communications of the ACM 35(8).

Gweon, Schonlau, Kaczmirek, Blohm, and Steiner (2017). Three methods for occupation coding based on statistical learning. Journal of Official Statistics 33(1).

Measure (2014). Automated coding of worker injury narratives. Proc. Gov. Statist. Sect. Am. Statist. Association.

Schierholz, Gensicke, Tschersich, and Kreuter (2018). Occupation coding during the interview. Journal of the Royal Statistical Society: Series A 181(2).


How to Make AI do Your Job for Statistical Classification of Industry and Occupation

Mr Jukka Kärkimaa (Statistics Finland) - Presenting Author
Ms Liisa Larja (Statistics Finland)

Download presentation

Coding of person's occupation and workplace's industry according to the statistical classifications (NACE, ISCO) can be very labour intensive. Many times the classifications are so complex, that neither the survey respondent nor the interviewer can choose the right code by herself. Instead, very often the interviewees are asked various auxiliary questions and the coding to the classification is done manually after the survey interview at the statistical office. In the Finnish Labour Force Survey (LFS), 15% of the working time of statisticians is solely used to this task. In this presentation we will show you how to make artifical intelligence do this work for you using machine learning methods.

We present classification results from applying ensemble methods such as random forest classifiers on combined register and survey data. We built several models based on demographic and socioeconomic register and survey data and free-text survey data. Our models predicted industry and occupation based on these factors with accuracies exceeding 90%. We also present work on assessing the reliability of the underlying methods.

By using Natural language processing, we can train probabilistic and algorithmic predictors of industry code based on free-text input from the Labour Force Survey. Combining the free-text based model with the register-data based model we can improve on the individual results of either model. 


Topic Modeling and Status Classification Using Data From Surveys and Social Networks

Final candidate for the monograph

Mr Suat Can (University of Bremen (Germany), Social Science Methods Centre) - Presenting Author
Professor Uwe Engel (University of Bremen (Germany), Social Science Methods Centre)
Ms Jennifer Keck (University of Bremen (Germany), Social Science Methods Centre)

Is it possible to achieve comparable population estimates of topical clusters using wholly different kinds of data access? Against the background of an emerging computational social science, the contrast between text mining and survey methods appears especially interesting. Three basic aspects deserve particular attention: (1) comparability is likely to be achievable only at the latent level (e.g. by techniques of probabilistic classification); (2) it should be possible to control for common sources of error (e.g., frame, selection, measurement effects), and (3) approximately the same type of content should be made available across the different kinds of data (e.g., re. social status and socio-demography, respectively).

In line with that, the paper deals with data and related metadata. First, it discusses topic modeling of textual data. This modelling technique refers to the unsupervised methods of machine learning and represents thus a completely automatic method. We employ this method and compare its results with a parallel content analysis based on manual coding. Secondly, the paper addresses the possibility of inferring the educational level from textual social-network data – for example from data from Facebook, Xing or Twitter – using a readability index technique. Such an index represents a form of vocabulary analysis and measures the ease with which a native speaker understands textual documents. Third, the paper is concerned with the possibility of inferring gender and immigration background from first names and surnames of comments’ authors by reference to relevant data bases and the development of probabilistic models for the automatic recognition of these status characteristics. Whereas surveys collect data on social status on a regular basis, there is strong need to enrich analyses of textual social-network data in that regard, in order to fill that otherwise existing gap. Besides assessing comparability, the paper thus aims at providing an evaluation of the practicability of the latter two techniques using the example of the present study.

Study design: The paper applies topic modeling to (a) relevant comments issued in social networks and (b) to responses to open-ended survey questions to the same publicly discussed topic(s). To achieve temporal comparability, all data is collected contemporaneously within a small period of time. While web scraping techniques are used to extract the comments, two web surveys are conducted to collect the accompanying survey data. Held comparable in terms of content, one survey is probability-based and the other one non-probability based to assess/control for effects of (self-)selection.


Machine Learning and Verbatim Survey Responses: Classification of Criminal Offences in the Crime Survey for England and Wales

Mr Peter Matthews (Kantar Public) - Presenting Author
Mr George Kyriakopoulos (Kantar Public)
Miss Maria Holcekova (Kantar Public)

Download presentation

Since 1981, the Crime Survey for England and Wales (CSEW) has been used alongside police records to measure crime in England and Wales. Around 35,000 interviews are conducted each year regarding individuals’ experiences of crime. Respondents who have been a victim of a crime in the last year are asked to describe the incident. These descriptions, along with other data collected in the interview, are used to code incidents under particular categories of offence (‘Robbery’, ‘Attempted assault’, ‘Theft of car/van’ etc.). The intention is that this categorisation should be as consistent as possible with the way crimes are classified by the police.

The CSEW is used to estimate the extent and nature of crime in England and Wales. As such, it is critical that the coding of offences is consistent and highly accurate. Currently, trained human coders review each case and, using a specially developed coding questionnaire, assign a particular offence code. Supervisors check a random subset of all cases, as well as any cases where the original coder was uncertain of the correct classification. Further checks are then conducted by researchers at the UK’s Office of National Statistics. As a whole, this process is time-consuming and resource-intensive.

Using 16 years of CSEW data, we investigate the possibility of using machine learning techniques to reduce the amount of human coding required. We test a number of different machine learning classifiers, assessing their ability to accurately predict the final offence coding. We also build an ensemble model to improve our overall confidence in predictions. We then propose a set of coding rules whereby some cases may be automatically coded with high confidence, while others will be referred to human coders for verification and/or more manual coding. Finally, we discuss the risks and challenges in using automated classification procedures in such a high profile and sensitive subject area.