BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Translations Across Nations: Exploring Natural Language Processing in Multicultural Applications

Chair Dr Diana Zavala-Rojas (Universitat Pompeu Fabra)
TimeSaturday 27th October, 09:00 - 10:30
Room: 40.010

Creating Synergies Between Survey Research and Machine Learning: A Road Map for Applying Tools From Computational Linguistics in the Translation of Survey Questionnaires

Dr Diana Zavala-Rojas (Universitat Pompeu Fabra) - Presenting Author

We present the results of a research project conducted between 2016 and 2018 under the Synergies for Europe’s Research Infrastructures in the Social Sciences.

We investigate the feasibility of applying tools from computational linguistics and linguistic corpora in survey translation i.e. the translation and translation assessment processes of survey questionnaires. Computational linguistics is a broad interdisciplinary research field which applies statistical modelling and computational methods to (natural) language processing. After revising the state-of-the-art of methods and tools in this discipline i.e. the adoption of deep learning techniques for language modelling (algorithms that do not conduct a programmed task but that learn data representations) we focus on one of its prominent applications: neural machine translation, and on those developments that have contributed largely to its development: parallel corpora (multilingual databases of aligned texts) and translation memories.

Neural machine translation has changed the paradigm of current developments in machine translation.
Then, we contextualise such applications for survey researchers, pointing out areas where windows of opportunity exist to incorporate tools that are commonly used in other areas of translation, and that can potentially benefit the translation and translation assessment of survey questionnaires.

We argue that establishing a road map for applying approaches from computational linguistics in survey translation should start with a systematic assessment of how the adoption of translation technologies would affect the quality of the translation output, and the translation process and assessment workflow. We show that the assessment can be conducted along three dimensions: theoretical assessment, practical assessment, and evaluation of the feasibility of adaptation.


Country Comparative Surveys Using Word Embeddings

Dr Magnus Sahlgren (RISE SICS) - Presenting Author
Dr Stefan Dahlberg (University of Gothenburg)

Download presentation

The field of survey based country comparative research has traditionally solely relied on survey questionnaires administered to controlled populations. One major obstacle with this methodology is the increasing difficulty and cost of finding sufficient amounts of representative respondents in the various languages. Decreasing response rates are not only devastating for the representativeness of the surveys in relation to a defined population; it has also forced researchers to opt for minimizing the length of the actual survey questionnaires, something that has made it more difficult to measure more complex concepts that ideally demands longer querying batteries.

This paper explores a complementary approach to country comparative surveys that use Natural Language Processing techniques applied to unsolicited data harvested from the Internet. Such data have the advantages of being contemporary, freely available, in vast amounts, and in many, if not most, languages covered in traditional country comparative studies. We use the web data to build representations of word usage patterns in continuous vector space models (so-called word embeddings). The word representations can be used to compile data-driven thesauri, which rank semantically related terms. By analyzing the semantic neighbors of a term, we can get a good understanding of how the term is being used in the data.

We suggest that such analyses can act as a complement to traditional survey methods in cases where the question concerns how citizens conceive of certain concepts, such as happiness, corruption, or the meaning of democracy. The research question in such comparative studies is often cast as a categorization problem, in which the task is to determine which of a predefined number of categories are more salient for citizens of different countries. As an example, a survey question about the satisfaction with democracy can concern different categories such as performance, principles, or actors. By categorizing the semantic neighbors of terms that represent the concepts in question (e.g. the term "democracy") with respect to these different categories, we arrive at a measurement of how prevalent these categories are in unsolicited language use in the various languages. An attractive byproduct of such analyses is that the categorized data can be used as training data for machine learning models that can automate the classification, which enables both large-scale and extremely resource-effective surveys.

This paper introduces the general methodology, and provides examples of studies using survey items such as satisfaction with the working of democracy, the presence of corruption, and happiness. From these studies we are able to rank countries according to their average levels of democratic satisfaction, presence of corruption and happiness, but we still do not know what these concepts actually mean to people. By combining survey data with online language data, we are able to capture both conceptual meaning as well as overall differences in levels. We also discuss challenges with the proposed approach, such as how to find relevant, comparable, and representative online data, and how to account for typological differences when building word embeddings for different languages.


The Meaning of Democracy: Using a Distributional Semantic Lexicon to Collect Co-Occurrence Information From Online Data Across Languages

Ms Sofia Axelsson (Department of Political Science, University of Gothenburg) - Presenting Author
Dr Stefan Dahlberg (Department of Political Science, University of Gothenburg)

The literature on public support for democracy has revealed significant cross-country differences in people’s attitudes towards democracy. Explanations for such variations can be found in the survey literature on “diffuse” versus “specific” support for democracy; whereas the former refers to support for the democratic principles in a more abstract sense, and is generally found in consolidated democracies, the latter concerns more specific support for political performance of democracies, which is more prevalent in new democracies (see Easton 1975; Norris 1999; Dahlberg & Holmberg 2012; Linde & Ekman 2003). Yet, how are we to know what democracy means to the people answering surveys, and thus be able to identify what they are expressing support for? Some scholars have further disentangled survey batteries in order to capture different notions of democracy among citizens living under different cultural and institutional settings, arguing that the concept of democracy becomes distorted in authoritarian settings (Welzel & Kirch 2017; Welzel 2013). However, the literature lacks systematic comparative studies outside the realm of surveys that takes institutional, cultural, and, importantly, linguistic variations into account.

Cross-cultural survey research rests upon the assumption that if survey features are kept constant to the maximum extent, data will remain comparable across languages, cultures and countries (Diamond 2010). Yet translating concepts across languages, cultures and political contexts is complicated by linguistic, cultural, normative or institutional discrepancies. Recognizing that language, culture and other social and political aspects affect survey results has been equated with “giving up on comparative research”, and, consequently, the most commonly used “solution” to equivalence problems has been for researchers to simply ignore the issue of comparability across languages, cultures and countries (Hoffmeyer-Zlotnik & Harkness 2005; King et al. 2004).

This paper contributes to the debate, using a distributional semantic lexicon, which is a statistical model for collecting co-occurrence information from large text data (Turney & Pantel 2010). The lexicon represents terms as vectors in multi-dimensional context space, where relative similarity between vectors indicate similarity of usage, which is often equated with semantic similarity. The method is motivated by a structuralist meaning theory known as the “distributional hypothesis”, which states that words with similar meanings tend to occur in similar contexts, and that the contexts shape and define the meanings of the words (Sahlgren 2006). Compared to other methodological approaches aimed at identifying and measuring cross-cultural discrepancies, this approach has the advantage of enabling us to analyze how the concept of democracy is used in its “natural habitat” (Wittgenstein 1958). Collecting geo-tagged language data from editorial and social online media thus allows us to explore the varieties in understandings of democracy across different languages and countries, and to map the ways in which democracy is used among populations and societies worldwide, also across different institutional settings and regime types.


Lost in Translation - How Differences in Word Intensity Affect Citizens' Satisfaction With the Working of Democracy

Dr Stefan Dahlberg (Department of Political Science. University of Bergen) - Presenting Author
Dr Magnus Sahlgren (RISE SICS)
Professor Jönas Linde (Department of Political Science. University of Bergen)

Survey based country comparative research rests on the assumption that all languages have words that are cross-culturally and semantically equivalent, despite the obvious problems associated with this assumption. This paper discusses a specific aspect of survey translation that concerns word intensity. We do this by focusing on one specific survey question, citizens’ satisfaction with the working of democracy, taken from the Comparative Studies of Electoral Systems (CSES) dataset module 3 and 4. We use social and editorial media data from 49 different language-country units, where we have measured term frequency and semantic diversity using the translated terms for the word “satisfied” taken from the original CSES questionnaires for all participating countries.

Word intensity has been studied in the field of Natural Language Processing in the context of Sentiment Analysis, where attitudinally loaded terms are used to gauge the sentiment towards specific targets of interest. In its most simplistic form, Sentiment Analysis is performed by counting occurrences of sentiment terms in the data. It has been shown that such a simplistic lexical approach can be improved by using proper weighting of the sentiment terms, which essentially quantifies the importance of a term for a specific sentiment class. As an example, it seems intuitive to assume that a term such as “excellent” should have a higher impact on a positive sentiment score than a term such as “ok”, due to the higher attitudinal intensity of the former word. In the same way, we will probably get different survey results if we ask respondents whether they are “happy”, “satisfied”, or “content” with democracy. Recent evidence from Denmark suggests that Denmark’s consistently high levels of self-reported happiness and life satisfaction are partly due to language effects (Lolle and Goul Anderson 2013; 2015). It is “easier” to be satisfied in Danish than in English while it requires something “more” to be happy in Danish than in English.

The question is: how can we quantify a term’s intensity? We take our starting point in the “semantic bleaching” hypothesis, which states that the semantic richness (in our case, attitudinal intensity) of a word correlates inversely with its frequency and the number of possible meanings of the word. That is, the higher a word’s frequency, and the more possible meanings a word can have, the less intense (or semantically rich) it is. This hypothesis suggests a very simple way to measure and compare the attitudinal intensity of words across languages: we simply compare their relative frequencies, and some measure of semantic diversity, in their respective languages.

When it comes to satisfaction with the working of democracy, the results indicate that word intensity in terms of term frequency and semantic diversity matters. We find a non-neglectable positive correlation between our intensity measures and satisfaction with democracy across countries that we do not manage to correct for by the inclusion of various control variables at different levels. The language of survey administration still matters.