BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Classifieds: Coding Open-Ended Responses Using Machine Learning Methods

Chair Mr Malte Schierholz (Institute for Employment Research (IAB))
TimeFriday 26th October, 09:45 - 11:15
Room: 40.004

Automated Topic Modeling for Trend Analysis of Open-Ended Response Data

In review process for the special issue

Mr Reuben McCreanor (SurveyMonkey) - Presenting Author
Ms Laura Wronski (SurveyMonkey)
Dr Jack Chen (SurveyMonkey)

Web surveys make it possible to collect open-ended responses in a format that was not possible in traditional (telephone or paper and pencil) surveys. Past survey research has shown that survey respondents are more open and honest, and less susceptible to social desirability bias, when responding to surveys conducted on the web as opposed to surveys conducted in-person or over the phone. Rather than relying on interviewers to transcribe and code, often by hand, every single response, web surveys can make use of new machine learning techniques to analyze data that respondents provide in their own words.

However, with the increased traffic generated by web surveys, we are confronted with new challenges in analyzing the vast amounts of data. While 100,000 responses is not a challenge for summary statistics, if 100,000 respondents answer an open-ended question with a 50-word response, we are left with approximately 12,500 pages of text data. With the advent of online surveys, it is no longer possible to analyze open response data with traditional labor-intensive techniques, and we instead turn to machine learning to find automated solutions to analyzing large volumes of open-ended response data.

Topic Modeling using the Latent Dirichlet Allocation (LDA) is a well-known technique in text-mining to help discover the hidden semantic structures in a text body. Given a set of open-ended responses, an LDA model allows us to assign topic distributions to a response giving both a distribution of topics within a response and the distribution of words in a topic. This, in turn, allows us to create quantitative summaries of the key topics mentioned by respondents.

However, implementing LDA models on survey data is still a highly manual process. In order to fit the model, we must first work out the number of topics, and once fitted, use the distribution of words in a topic to manually create the labels for each topic requiring subjective judgment from the person running the model. Instead, we propose a novel technique that creates the automatic categorization of pre-defined topics within an LDA model. After running the LDA model, we get a list of topics with associated words. Using the Word2vec algorithm, we can create Neural Embeddings for each of these topics by using the response data to train a one layer Neural Network. Finally, using this same trained model, we can convert the predefined topic labels to word vectors and map the labels to our topics using a simple cosine distance metric.

This approach not only automates the labeling process, but allows a topic model to be applied to survey data across multiple time periods to compare trends in open-ended responses. Using SurveyMonkey data, we apply this model to the same open-ended question on why respondents approve or disapprove of Donald Trump across two distinct time periods in order to analyze how the topics assigned to open-ended responses changed with time.


Natural Language Processing for Open-Ended Survey Questions

Dr Cong Ye (American Institutes for Research)
Dr Rebecca Medway (American Institutes for Research)
Ms Claire Kelley (American Institutes for Research) - Presenting Author

Would you consider “Certified Nursing Assistant” and “Cert. Nurse Asst.” to be the same qualification? If you ask a human, the result is clearly yes, but if you ask a computer your results may vary. When you add in spelling mistakes, abbreviations and alternate names, the task becomes even more complicated. It may be obvious to you that “CNA” stands for Certified Nursing Assistant. It may also be clear to you that “Nures” is probably meant to be “Nurse”. However, creating an algorithm to handle these distinctions remains a challenge and these problems are further compounded when you are dealing with 10s or 100s of thousands of responses. Automated understanding of natural language is particularly relevant to survey researchers seeking to make use of open ended survey responses.

Using data from a large national survey, we examine ways in which natural language processing can be used to classify open ended survey responses. In this self-administered survey, respondents who reported having a certification or license were asked to answer an open-ended question about the field in which they were certified or licensed. In previous years, these responses have been manually coded into about 40 field codes. Our paper examines ways in which both supervised and unsupervised natural language processing methods can be used to potentially increase the efficiency of this categorization process.

Our first approach is to use data from previous years that has already been classified manually as a labeled data set from which supervised algorithms can learn the appropriate patterns. After basic data preprocessing (including stemming and removal of punctuation) responses were transformed into document term matrices and regression tree methods were trained on the pre-classified data to create an automated mechanism of classification. This method was implemented using single words as well as n-grams of size 2 and 3. While this supervised method is methodologically attractive because it makes use of an existing classification scheme, this also means that we must rely on the categories created by human classifiers rather than potentially discovering new categories based on patterns in the data.

Another approach is to use clustering to try to detect patterns in certifications without the need for labelled training data. This approach is attractive because it can make use of language but also of other background characteristics reported by the respondent. This allows us to create a “topic model” that may be sensitive to regional variation in certification names, variations in respondent background characteristics and changes over time, among other things. By creating clusters of classification we are also potentially able to detect new occupational grouping that had not already been determined by researchers.

While there are some unique challenges to such natural language on open ended survey responses including the need for large amount of data to build strong models and then the need for sophisticated computing to handle the volume of data collected, an expansion of natural language processing techniques to this area offers promise in both efficiency


Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

In review process for the special issue

Professor Matthias Schonlau (University of Waterloo) - Presenting Author
Dr Hyukjun Gweon (University of Illinois)

Download presentation

Text data from open-ended questions in surveys are challenging to analyze and are often ignored. Open-ended questions are important though because they do not constrain respondents’ answer choices. Where open-ended questions are necessary, often human coders manually code answers. When data sets are large, it is impractical or too costly to manually code all answer texts. Instead, text answers are converted into numerical variables and a statistical / machine learning algorithm can be trained on a subset of manually coded data. This statistical model is used to predict the codes of the remainder.

In this paper we consider open-ended questions where the answers are coded into multiple labels. “Multi-label” is a computer science term; in the survey literature such questions are called check-all-that-apply questions which allow respondents to make multiple choices. The open-ended question in our Swedish example asks about things that make the respondent happy. Respondents are explicitly told they may list multiple things that make them happy. By contrast, multiple choice questions have a single outcome variable where one of multiple categories must be chosen.

To our knowledge, there is no published research on the use of multi-label algorithms to code answers for all-that-apply open-ended survey questions. Algorithms for multi-label data take into account the correlation among the answer choices and therefore may give better prediction results. For example, when giving examples of civil disobedience, respondents talking about “minor non-violent offenses” were also likely to talk about “crimes”. We compare the performance of multi-label algorithms (Rakel, CC) using support vector machines as a base classifier with BR. BR refers to applying single-label algorithms to each response separately. Our performance metrics are average error (“Hamming loss”) and “all labels are correctly predicted” vs “at least one label is incorrect” (“0/1” loss). Performance is evaluated using 5-fold cross validation on data from three open-ended questions: 1) things that make you happy (Swedish data) and 2) meaning of the word immigrant (German data) and give examples of civil disobedience (German/Spanish data).

We find weak bivariate label correlations in the Swedish data (r<=0.10), and stronger bivariate label correlations in the immigrant (r<=0.36) and civil disobedience (r<=0.46) data. For the data with stronger correlations we found both multi-label methods performed substantially better than BR using 0/1 loss and had little effect when using Hamming loss. For data with weak label correlations, we found no difference in performance between multi-label methods and BR.

We conclude that automatic classification of open-ended questions that allow multiple answers may benefit from using multi-label algorithms for 0/1 loss. The degree of correlations among the labels may be a useful prognostic tool.


Democracy in Writing: Comparing the Meaning of Democracy in Open-Ended Survey Responses and in Big Online Text Data

Professor Jonas Linde (University of Bergen) - Presenting Author
Dr Stefan Dahlberg (University of Bergen and University of Gothenburg)
Dr Magnus Sahlgren (RISE SICS)

It has been argued that online text data, which can be used to systematically analyze people’s communicative behavior, could become a complement to traditional opinion polls and surveys. From this perspective, the application of language technology to online text data has provided social science scholars with innovative ways to deal with research questions that were previously unapproachable, and to address the changing landscape of public opinion research and research methodology. However, vast amounts of text data are not enough for solving research questions within social science, and selection problems and representativeness cannot be solved by merely adding more data. So far, great effort has been put on the development of analytical tools, while less attention has been placed on the actual data. For whom is data collected from the Internet representative? To what extent is it comparable across countries, languages and demography?

Focusing on the representativeness of online text data, this paper presents an attempt to compare the meaning of the concept of democracy as it is used online and by ordinary citizens in open-ended responses from Norwegian and Swedish surveys based on country-representative samples. We thus use two types of data. First, we have collected open-ended survey questions where respondents are asked to write freely what they associate with the word “democracy”. Second, we use text data – both social media and edited media – collected from the Internet. In order to compare the meaning of democracy in the survey data and the online data, we develop a conceptual framework consisting of eight categories, or dimensions of democracy (community, ideology, principles, procedures, performance, condition, and actors). By applying this framework, we are able to investigate whether ordinary citizens emphasize other dimensions of democracy than those prevalent in the online text data.

In order to analyse the two types of text data we use a commercial tool called the Gavagai Explorer. This is a tool for qualitative analysis of text data, such as open-ended survey questions. The tool works by first clustering the terms used in the data, and then grouping the texts according to the term clusters. The user can interactively refine the various clusters, and also assign labels to them.

Initial analyses conducted on the Norwegian data reveal interesting differences in the meaning of democracy. We find differences between the citizen panel data and the online data. The most striking difference is that the occurrence of democracy in terms of ideology that we find in the online data is completely missing in the survey data (where notions of democracy in terms of principles are most frequent). We also observe differences in the online text data. In social media, democracy is most frequently discussed in terms of ideology, while democracy in terms of principles dominates in editorial media.