BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Posters 2 (actively presented from 10.30 to 11.00 and 15.30 to 16.00)

Chair Dr Antje Kirchner (RTI)
TimeSaturday 27th October, 08:30 - 17:30
Room: 30.S02 S. Expo

These posters are the result of the Barcelona Dades Obertes Data Challenge organized by the city of Barcelona. For more information on the institutions and the data challenge please see:

Poster Title High School
Investigating Complaints in GraciaInstitut Vila de Gràcia
Social Cohesion and Type of NeighborhoodInstitut Ferran Tallada
Free WI-FI Points in BarcelonaInstitut Juan Manuel Zafra
Access to Housing in BarcelonaInstitut Joan Brossa
A Study of Traffic Accidents in BarcelonaInstitut J. Serrat i Bonastre
WI-FI PointsInstitut Josep Comas i Solà

Indirect Sampling Applied to Dual Frames

Professor Manuela Maia (Católica Porto Business School) - Presenting Author

Under-coverage of sampling frames is one of the main problems that we face in survey sampling. To deal with this problem we applied a strategy, called Multiple Frames that consists in combining several frames, in order to provide complete or nearly complete coverage of the target population. However, in most cases, the frames overlap causes problems to estimate in what regards sample weights computation. To overcome this problem, Lavallée (2007) proposed an alternative method, called Indirect Sampling, to deal with the overlapping problem of sampling frames on survey estimates (Figure 1). To select, in a probabilistic way, the necessary samples for surveys, this approach defines available sampling frames (i.e., lists of units meant to represent the target populations). Unfortunately, it may be the case that there are no sampling frames for the desired target population. In that case, we can choose a sampling frame that is indirectly related to this target population. More precisely, we can use data from two populations, UA and UB, that are related to each other. If we want to produce an estimate for UB but, unluckily, we only have a sampling frame UA, we can imagine the selection of a sample UA and produce an estimate for UB, using the links between the two populations.

In this paper, the classical estimators of multiple frames sampling - Domain Membership estimator and Unit Multiplicity estimator – are imported to the context of indirect sampling. Additionally the Optimal Deville and Lavallée’s estimator is decoded to the context of multiple frames surveys.

Using the Optimal Deville and Lavallée’s estimator we deduce a new class of indirect sampling estimators capable of being applied in cases of multiple frames surveys, more specifically in the particular case of dual frame surveys, where we find that 8 different kinds of links.


Estimation of Selection Error and Bias in Internet Data Sources by Linking With Register Data

Dr Maciej Beręsewicz (Poznań University of Economics and Business / Statistical Office in Poznań) - Presenting Author

Big data and the Internet as a data source have become an important issue in statistics, particularly in official statistics. There are several multinational initiatives (e.g. ESSnet on Big Data) that focus on the quality and suitability of estimates based on new data sources to complement or supplement existing statistical information. However, before these data can be used for official statistics, it is essential to identify potential sources of non-representativeness.

In situations where there is no information about coverage or selection bias, one can try to detect and estimate bias by comparing new data sources with auxiliary sources already used in statistics i.e. surveys or registers. Data can be linked at the level of units or domains to investigate bias and obtain information about the selection mechanism and items that are not advertised online. It is crucial to discover the underlying selectivity because it may be linked to the effects of non-response and will call for specific methods of dealing with it.

Therefore, in the article we describe an attempt to link Internet data sources with register data at the unit level in order to detect selection error and estimate selection bias. The study will focus on residential properties offered by two leading online advertisement services in Poland, as well as those listed in the Register of Real Estate Prices and Values, which contains all sold properties that have an established ownership. We believe that this independent source containing a different but related population to that advertised online will provide information about properties that are not advertised online but are put up for sale.

This research has been supported by the National Science Center, Preludium 7 grant no. 2014/13/N/HS4/02999.


Old Problems, New Approaches: The Appearance of Suicide and Depression in the Online Social Media - A Study of Instagram

Dr Júlia Koltai (Hungarian Academy of Sciences)
Dr Zoltán Kmetty (Eötvös Loránd University, Budapest) - Presenting Author
Dr Károly Bozsonyi (Károli Gáspár University, Budapest)

The research of suicide in social sciences goes back to Durkeheim’s famous book, and the scientific interest in the phenomenon has not declined from then. The increasing suicide ratio in some of the developed countries, like the US, makes this topic even more important. There is specialized literature on the explanation of the different spatial and temporal trends of committed suicide. It is a well-known phenomenon in suicide research that the incidents of suicide follow a clear seasonal pattern. There are minor differences between countries, but the main trend is that the ratio of suicides is high in the spring and summer, and low in the winter. There is also a strong weekly pattern: the risk of suicide is high on Mondays and much lower on the weekends.

In the recent few years, many studies were published about how suicide can be studied using Big Data and Social Media Data. But only few studies focused on the spatial and temporal appearance of suicide in the digital data.
In our paper we try to work on this research gap, and answer the following question: Is it possible that a general negative social climate exists, which, on the one hand, manifests itself in the number of suicides committed, and on the other hand, appears in the content of post, messages and hashtags in social media? 

In our analysis, we will focus on the time-trends of suicide related expressions and compare them with the well-researched and well-documented time-trends of committed suicide. Necessarily we do not suppose that the people who commit suicide are the same people who show negative emotions on social media, we only hypothesise that there is a general social and emotional climate that changes in space and time. This climate affects both social media posts and comments and the number of commited suicides. Therefore, we do not think that the correlation between the two phenomena implies a cause and effect relationship, but we applied a causality-scheme, in which there is a common cause of the emotions that appear in the social media data and also on the number of suicides. 

To make sure that the observed trends are not just the trends of social network activity, we will normalize our data with the estimated activity of the online social network for the examined period. We will use Instagram data of a given time period for the analysis focusing on English language posts and hashtags.


Merits and Limits of Measuring the Total Acceleration of Smartphones in Mobile Web Surveys Using SurveyMotion

Mr Stephan Schlosser (University of Göttingen) - Presenting Author
Mr Jan Karem Höhne (University of Göttingen)

In recent years, the use of mobile devices, such as smartphones, in web survey responding has increased markedly. The reasons for this trend seem to be twofold: First, the number of people who own a smartphone has increased. Second, the high-speed mobile Internet access has increased. This technical-driven trend in survey responding allows researchers to collect JavaScript-based paradata that can be used to describe respondents’ response behavior (e.g., response times, scrolling events, screen taps, and the in/activity of web survey pages). Smartphones also have a large number of implemented mobile sensors (e.g., accelerometers and gyroscopes), all of which collect data that recognize user actions. Similar to other types of paradata, sensor data can be passively collected by means of JavaScript and inform about respondents’ physiological states (e.g., movement and speed).

We now propose “SurveyMotion (SM),” a JavaScript-based tool for measuring the motion level of mobile devices, in general, and smartphones, in particular, to explore completion conditions and to draw conclusions about the context of mobile web survey completion. Technically speaking, SM gathers the total acceleration of mobile devices. We conducted a usability study with n = 1,452 smartphone respondents to explore the technical potentials of measuring acceleration in mobile web surveys. The study contains data from 29 different smartphone manufacturers, 208 different smartphone models, and 13 different Internet browsers.

The data analyses reveal that only for approx. 3% of the mobile respondents no acceleration could be gathered. A closer look at the user-agent-strings of these respondents sheds light on the matter. In sum, three reasons could be identified for unsuccessful measurement: Inactivated JavaScript, device-related issues (e.g., comparatively old or low budget devices), and browser-related issues (e.g., comparatively old browser versions).

All in all, it seems that the collection of JavaScript-based sensor data (i.e., acceleration) in mobile survey research is an achievable and promising new way to research respondents’ response behavior and completion conditions.


Supplementing Probability-Based Surveys With Nonprobability Surveys to Reduce Survey Errors and Survey Costs

Dr Joseph Sakshaug (German Institute for Employment Research) - Presenting Author
Dr Arkadiusz Wisniowski (University of Manchester)
Mr Diego Perez-Ruiz (University of Manchester)
Dr Annelies Blom (University of Mannheim)

Scientific surveys based on random probability samples are ubiquitously used in the social sciences to study and describe large populations. They provide a critical source of quantifiable information used by governments and policy-makers to make informed decisions. However, probability-based surveys are increasingly expensive to carry out and declining response rates observed over recent decades have necessitated costly strategies to raise them. Consequently, many survey organizations have shifted away from probability sampling in favor of cheaper non-probability sampling based on volunteer web panels. This practice has provoked significant controversy and scepticism over the representativeness and usefulness of non-probability samples. While probability-based surveys have their own representativeness concerns, comparison studies generally show that they are more representative than non-probability surveys. Hence, the survey research industry is in a situation where probability sampling is the preferred choice from an error perspective, while non-probability sampling is preferred from a cost perspective. Given the advantages of both sampling schemes, it makes sense to devise a strategy to combine them in a way that is beneficial from both a cost and error perspective. We examine this notion by evaluating a method of integrating probability and non-probability samples under a Bayesian inferential framework. The method is designed to utilize information from a non-probability sample to inform estimations based on a parallel probability sample. The method is evaluated through a real-data application involving two probability and eight non-probability surveys that fielded the same questionnaire simultaneously. We show that the method reduces the variance and mean-squared error (MSE) of a variety of survey estimates, with only small increases in bias, relative to estimates derived under probability-only sampling. The MSE/variance efficiency gains are most prominent when a small probability sample is supplemented by a larger non-probability sample. Using actual cost data we show that the Bayesian data integration method can produce cost savings for a fixed amount of error relative to a standard probability-only approach.


Exploring Random Respondent Matching With Simulated Multi-Wave Survey Data

Mrs Angela Ulrich (D3 Systems) - Presenting Author
Mr David Peng (D3 Systems)
Mr Ethan Beaman (D3 Systems)

Surveys provide important population insights. However, are large scale multi-wave surveys potentially reaching the same respondents? An important aspect of surveys is privacy for respondents, but with such privacy and technological limitations in certain countries, there remains the question, who are we reaching. We assume, with a large enough sample size, and appropriate design, the opinions collected reflect the overall opinions of the population. However, are we collecting the same opinions wave after wave, and limiting our coverage? Usually, we look at similar responses within surveys, but now we want to look at similarities across waves of the same stream. We will explore simulated data based from real data and explore potential matching techniques. While this remains exploratory, the findings could be used to monitor interviewer fieldwork across multiple projects, potentially providing another way to assess data quality of interviewers.


Reimagining Survey Research: Transforming a Traditional Survey Program Through Advanced Analytics

Mr Ryan Cristal (Peace Corps) - Presenting Author

Download presentation

While the utilization of advanced analytics to inform decision-making is well established in the private sector and at larger public sector/non-profit organizations, the implementation of these capabilities is at a nascent stage in many smaller institutions. There are multiple reasons for this lag, relating to both technical capacity and organizational will.

Run by the USA government, the Peace Corps volunteer program has sent over 230 thousand Americans to 141 countries to provide technical assistance and foster intercultural understanding. While global in scope, the Peace Corps is a relatively small US government agency with limited resources devoted to data collection and institutional research. Despite these challenges, in a short time period, the Peace Corps has significantly improved the data quality and analytic utility of its survey insights program through the judicious application of data linkage and quantitative modeling techniques.

Transforming the Peace Corps survey program required careful cultivation of organizational support by demonstrating the value and promise of new techniques. This presentation will discuss the conditions that made this initiative more likely to succeed, the challenges faced, and the analyses produced. Prior to this, several necessary pieces were already in place: an institutional leadership with a strategic mandate to acculturate evidence-based decision making within the organization, an established large-N annual survey program, and multiple sources of administrative and program performance data.

Building upon this base, there were opportunities to begin socializing the organization to the possibilities of treating multiple data sources holistically and leveraging advanced quantitative approaches to decision-making analysis. One strategy was to utilize what was already available in order to generate new insights about the organization and actionable recommendations for improvement. In this case, multivariate regression modeling of large-N survey data was employed to investigate drivers of Peace Corps Volunteer effectiveness, an approach adapted from program effectiveness and return-on-investment analysis techniques more common in the for-profit context.

Concurrently, approval was secured for technical improvements to survey methodology, for example, implementing the use of administrative data capabilities to directly build and administer survey distribution frames. Care was made to proactively expose stakeholders to the immediate benefits. Improved data quality and analytic value were showcased, emphasizing organizational business issues not previously addressable in the absence of respondent-level data linkage of survey data with administrative, demographic, training, programmatic and other performance data.

The progress made in reinvigorating the Peace Corps’ survey program has produced higher data quality, fresh insight, and greater stakeholder confidence, setting the stage for further capabilities growth. The presentation will close with current efforts to use modeling analytics to explain and predict survey non-response, to identify and address agency metric deficiencies, and to enable future analytic decision-making initiatives.


Calibrating Key Performance Indicators for an Eye Tracking Attention Panel

Ms Emelie Löfdahl (Tobii Pro)
Mrs Karin Nelsson (Inizio) - Presenting Author

Eye tracking is a unique method to objectively measure consumers' attentions and spontaneous responses to marketing messages. A growing trend in market research is to capture subconscious and unbiased data through implicit methods. Eye tracking is among the most effective of these techniques.

Eye tracking allows the market researcher to study how consumers react to different marketing messages and understand their cognitive engagement. It mitigates recall errors and the social desirability biases while revealing information that conventional research methods normally miss. Eye tracking studies also close the gap between in screen and action (e.g. click, completed view). By studying whether or not an ad is actually seen or being ignored, the advertiser is able to understand how to best communicate with the busy consumer.

Eye tracking can help market researchers to discover, for example:

- What marketing elements capture the eye of the consumer?
- Is your advertising getting attention?
- How effective is the advertisement?
- What does the consumer journey look like?

Tobii is the global leader in eye tracking, with a vision of a world where all technology works in harmony with natural human behavior. Inizio is a research company with a digital focus helping their clients recruit, build and maintain high quality panels for research purposes. In this case Inizio is helping Tobii recruit members to join the Tobii attention panel. This paper will be presented jointly by the two companies to understand how sampling and calibration using census data can validate big data generated from eye tracking, to understand the opportunities with a passive panel equipped with eye trackers, and how data can be combined and utilized to validate users' online behaviour.

Tobii's Attention Panel is based on participants equipped with eye trackers mounted on their computers, and all data is passively gathered. The panel provides continuous real-time data representing a panel member's online behavior and ad attention. The frame is based on the technological limitations the current devices provide. Currently, the panel has more than 500 members, and is growing with a goal to reach 1500 panel members in Sweden during 2018.

The data is currently used by media houses and media buyers in Sweden and the aim is soon to also expand globally. Important key performance indicators are ad fixation and fixation time. If a user fixates their gaze on the ad and, if so, for how long is critical information to understand the effect of advertising, especially for branding advertising as there are no relevant efficiency metrics available on the market today.

In this paper Inizio and Tobii aim to present:

- What are the challenges and solutions of building a panel equipped with hardware (in this case eye trackers).
- How to work with user on-boarding.
- How to calibrate and weight the panel.
- How to handle a passive panel and what the differences are compared with a traditional, active, panel in terms of results and insights.


The Classification of Comments About Mobile Phones in the Online Shops

Mrs Natalia Kharchenko (Kiev International Institute of Sociology) - Presenting Author
Mrs Maryna Shpiker (National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”)
Mr Oleksandr Pereverziev (National University of "Kyiv-Mohyla Academy")

The online purchases of gadgets are increasing each year. Many usage and attitude surveys are carried out to know the views of people about the products. However, most of the information about the usage of gadgets is available online through discussions on thematic websites, social media, and other resources. The aim of this research is to identify what kind of information about consumer preferences and behavior can be obtained from an open source, which is the comments about the products in the online store. The following aspects of the issue were investigated:

- A possibility of getting an accurate comparative assessment of the product from the point of view of customers by automatic classification of the comments about the product as positive and negative ones;
- A reliability of identification of high-priority selection criteria of the product formulated by the customers in natural language on the website as compared with online survey data;
- A ratio of the rational and emotional components of the customer's choice identified on the basis of the comments as compared with online survey data.

We have collected 53,500 users' comments on the website of online shop in the category "Mobile Phones", such as product name, product category, date, nickname, and the comment itself. On this basis, the classification model was created which allows determining a positive/negative modality and characteristics of goods in the particular comment. We used the software environment R for statistical computing and package RTextTools for natural language processing and text data classification. To compare the results obtained by different methods, we conducted an online survey of customers who purchased a mobile phone in the last 12 months about the important criteria of selection, priority of these criteria, experience of leaving comments in online shops, etc.

The classification model successfully classifies the positive/negative modality of comments about the different brands of mobile phones left on the website of internet shop with a precision of about 55%. The lack of precision can be partially attributed to the fact that many comments are not value judgments (for example, questions or neutral information about the purchase) or the comments contained both positive and negative statements. Methods for improving the accuracy of modality estimation have been proposed.

In addition, we identified the product characteristics, which are important for the consumer in choosing or evaluating the quality of work of the mobile phones. Further, these characteristics were grouped into rational and emotional groups and their representation in the data set of comments was evaluated. The obtained results were verified using the data of the online poll of consumers.

The proposed approach allows obtaining up-to-date information about consumer preferences and underlying selection criteria by faster, automated and cheaper ways on the basis of public data generated by users themselves.
This approach can be used for classification of the online feedback of other groups of products. It can be applied as a supplement or replacement of traditional marketing research.


The Generations & Gender Survey: The Future of a Cross-National Survey Online

Dr Thomas Emery (NIDI) - Presenting Author

The Generations & Gender Survey (GGS) is a large cross-national survey focusing on demographic behavior and family dynamics (www.ggp-i.org), and in 2020 a new round of the survey will be conducted in countries across the world. In preparation for this, the GGS has run a field experiment utilizing a push to web (P2W) framework which encourages users to fill out the GGS online (CAWI) rather than through a face-to-face (F2F) interview. The experiment includes several sub-experiments which are designed to test the optimal parameters of a P2W approach. This includes a test of incentive levels, a test of reminder strategies and a test of invitation letters in a random route setting. Furthermore, these field experiments not only offer an opportunity to examine differences in response rates and patterns between F2F and CAWI, but also how the CAWI can be leveraged to allow for cross-pollination with other web data such as paradata and general web activity. The field experiment includes data from 1,000 respondents in Croatia, Germany and Portugal (3,000 total) and the paper presents the design and results from the field experiment.
The centralized CAPI and CAWI system also opens up the possibility to integrate a number of different types of data with the traditional Generations and Gender Survey and better explore specific research areas such as internet usage and its role in shaping respondents' family life. Specifically, the centralized system allows for more detailed paradata to be collected on how individuals answer the survey (Callegaro 2013). This includes not only timing and key stroke information, but also information such as the device and browser that are used as well, and the geo-location of respondents. Taken together, the paradata of a respondent and the type of device provide insights into what type of web user the respondent is and the geo-coding of respondents’ residence can be used to derive local internet speeds providing a multi-dimensional view of respondents (Cesare et al. 2016). Using the data from the pilot, we seek to explore the ability of such data to provide insights over and above the response in the questionnaire itself.

This paper will present the initial preliminary findings from the field experiment and their implications for the new round of data collection in 2020. Given the cross-national nature of both the GGP and the experiment itself, the results presented will be of broad interest to survey researchers as we adapt to a changing and evolving data environment.


Predicting Political Behavior and Attitudes Using Digital Trace Data

Dr Ruben Bach (University of Mannheim) - Presenting Author
Dr Christoph Kern (University of Mannheim)
Dr Ashley Amaya (RTI International)
Professor Florian Keusch (University of Mannheim)
Professor Frauke Kreuter (University of Mannheim)
Mr Jan Hecht (SINUS Institut)
Mr Jonathan Heinemann (respondi AG)

Download presentation

For many years, surveys were the standard tool to measure attitudes and behavior in social science research. In recent years, however, researchers have shifted their focus to new sources of data, especially in the online world. For instance, researchers have analyzed the potential of replacing or supplementing survey data with information collected from Twitter and other social media platforms (e.g., Conover et al. 2011), personal devices such as smartphones (e.g., Blumenstock et al. 2015) and data from other places where people leave digital traces (e.g., Goel et al. 2012).
Using data from social media to supplement or replace survey data may produce biased results, however, as individuals often promote a favorable, yet incomplete picture of their selves on social media (van Dijck, 2013). Other records of digital traces, such as browsing histories (i.e., domains visited) may provide a more complete picture of individuals' selves as they do not build on individuals' self-presentation in the online world. Browsing histories may reveal individual attitudes and behaviors because people tend to consume news that reinforce their existing views (e.g., Barbera et al. 2015).
In this paper, we explore the feasibility of using individuals’ online activities to measure political attitudes and behavior. Specifically, we explore the potentials of using browsing histories and app usage to substitute traditional survey data by predicting individuals' political behavior and political attitudes from their online behavior.
Members of a German commercial non-probability panel gave permission to track their browser and app usage over the four month period leading up to the 2017 German federal election. Panelists also participated in three waves of a panel survey where they were asked questions about the various ways they consume political news and information about politicians in both the offline and online world. We use these survey records to supplement the online behavioral records. Using machine learning methods, we compare predictive performance of various combinations of prediction models containing basic socio-demographic information only and/or models supplemented with records of online behavior. This approach allows us to learn whether political behavior and attitudes can be inferred from digital trace data, especially online browsing behavior.
Preliminary results indicate that (1) complementing standard sociodemographic features with online behavioral records improves prediction performance and (2) demographic characteristics can be substituted with digital trace data without a substantial loss in accuracy for selected political outcomes. The explanatory power of digital trace data thereby depends on which part of the political spectrum is studied, where predicting populist party affiliation benefits most from including information from individuals' online behavior. This result supports the notion that online passive data may help to infer the prevalence of attitudes or behaviors that are prone to social desirability bias.