BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





How Do You Like Those Likes? Exploring the Validity of Measures Derived from Social Media Data

Chair Dr Lars Lyberg (Inizio)
TimeSaturday 27th October, 14:00 - 15:30
Room: 40.006

Can Facebook "Likes" Measure Human Values?

In review process for the special issue

Dr Daniel Oberski (Utrecht University) - Presenting Author

Human values - the importance of broad goals in life - are of interest to social scientists because they are thought to predispose people to certain attitudes, opinions, and behaviors. They are commonly measured using sets of predesigned survey questions such as the Schwartz Values Scale (SVS), Schwartz portrait value questionnaire (PVQ), implemented in the European Social Survey, or Inglehart's partial ranking task, implemented in the World Values Survey and European Values Study.

This paper examines whether human values can be measured by observing which pages a user "likes" on Facebook. We analyze a dataset collected by Kosinski et al. (2015) of 111,430 people "liking" 19 million possible Facebook pages, linked with the Schwartz Values Scale questionnaire (n = 12,073). After applying an unsupervised topic model to the Facebook "likes" and traditional factor analysis to the SVS, we describe the reliability and (criterion and face) validity of these learned topics as measures of Human Values.

Initial results predicting values factors with gradient boosting machines indicate that Facebook "likes" can indeed validly measure Human Values, but also that the reliability of this measurement is decidedly low. We discuss these results, further work, as well as some remaining barriers to the use of Facebook for human values measurement.

References

Kosinski, M., Matz, S., Gosling, S., Popov, V. & Stillwell, D. (2015) Facebook as a Social Science Research Tool: Opportunities, Challenges, Ethical Considerations and Practical Guidelines. American Psychologist.


On the Validity of Statistical Inference Using Social Media Data: Two Interpretations of an Existing Study

Miss Martina Patone (University of Southampton) - Presenting Author

Download presentation

The aim of this paper is to consider if and how it is possible to use social media data to make statistical inference as we do with other types of data. Two different approaches of analysis are presented, a one-phase and a two-phase one and they are used to interpret an existing study made with social media data.

Social media data are secondary data; they are not constructed or collected for purpose of statistical analysis. Yet, they present several challenges on how traditional statistical analysis can be made with those data to bring meaningful and statistically valid conclusions.

The process that has generated the data is unknown and there is no randomized sampling design; therefore, design-based inference cannot be used. A possible alternative is to use a superpopulation model for the analytic variables of interest. A challenging problem in this context is that the sample chosen represents only a part of the target population, consisting of the users of a social media platform. Thus, generalization of the conclusions to a wider population is not straightforward and requires some assumptions. Furthermore, the units of selection and analysis are different; researchers are generally interested in people; however, the data collection conducted via APIs, in most of the cases, allows selecting the messages shared on the social platforms, and not the users.

In this work, we focus on the superpopulation approach to inference and discuss the implications of dealing with two distinct sets of units. A one-phase and a two-phase approach are presented.

The one-phase approach does not deal with the problem of two distinct units directly and it focuses instead on modelling the estimates obtained over time from the two data sources, social media and survey. The aim is to examine whether the social media estimates can be used to improve the efficiency of the survey estimates, using them as auxiliary in the estimation process, or even to replace them. Under this approach, the problem of the representativeness of social media data is not of interest, anymore, since the unit of analysis considered is now time, leaving open the question whether the same model will hold in the future.

The two-phase approach involves transforming the units of selection into the units of analysis. Under this approach, a two-phase data life cycle is described to assess the accuracy of the statistics produced. We derived in a sequential framework the conditions under which the selection mechanism can be ignored when making analytic inference from social media data. Here, no use of survey data is made and the statistical validity of the conclusion is based on the social media data generation process only.

The two approaches are formulated looking at the existing case study of the Dutch Confidence Index and the Social Media Sentiment Index. This case study is part of the research conducted by Statistics Netherland on the use of social media data in official statistics.


Improving the Measurement of Political Behavior by Integrating Survey Data and Digital Trace Data

Dr Sebastian Stier (GESIS)
Dr Johannes Breuer (GESIS)
Dr Pascal Siegers (GESIS) - Presenting Author
Dr Arnim Bleier (GESIS)
Dr Tobias Gummer (GESIS)

Computer-mediated communication has become deeply ingrained in political life. People use digital technologies to get political information and news, directly follow political actors, discuss politics with friends, advocate for political causes or mobilize offline protests. The measurement and analysis of these activities pose considerable challenges for researchers since they are distributed across multiple channels and platforms, intertwined, and ephemeral. Political science has traditionally studied these phenomena using survey methods. These, however, suffer from the unreliability of self-reported media use (Prior, 2009). Studies from the emerging field of computational social science, on the other hand, collect digital traces of human behavior in a non-intrusive way. At the same time, these approaches oftentimes do not collect the necessary attributes of research subjects (e.g., sociodemographic or personality characteristics) and/or outcome variables (e.g., voting) which are necessary for answering (causal) questions about the relationship between information exposure and political behavior. Our project synthesizes these two paradigms as they have the potential to compensate for their respective weaknesses when combined in a systematic way.

We use a dataset that links web browsing histories from 2,000 German online users to their responses in a survey. That way, we can objectively measure people’s online behavior while at the same time surveying them for sociodemographic variables and political attitudes that are difficult to measure or can – at best – be approximately inferred from digital traces. Our sample is recruited from a large German online access panel that provides web browsing histories for a subset of their panelists. The respondents were incentivized and consented to be tracked online. In April, our project will start collecting data for 12 months.

We have several research goals that aim at contributing to ongoing debates on item measurement and political behavior. The few related studies that exist (e.g., Guess, 2015; Scharkow, 2016) have predominantly used such an integrated research design to investigate the validity of self-reported media use, but have not focused specifically on questions about political behavior. For our analytical purposes, we will assign the visited websites to politically relevant categories. We will primarily construct this categorization inductively based on the data, but definitely differentiate legacy media (only their /politics subdomains), online-only media and partisan/other political actors.

This allows us to investigate at a more fine-grained level than previous studies: (1) How much time do people devote to political purposes online? (2) How can standard survey items such as political interest or party identification be operationalized through web browsing data? (3) How strong are the correlations between measures derived from web browsing and self-reports? (4) Which variables (e.g., sociodemographics, political knowledge) predict the measured deviations between self-reported and tracked behavior?


External and Internal Quality of Big Data

Professor Beat Hulliger (FHNW School of Business) - Presenting Author

Download presentation

External quality:
Big Data seems not concerned with the problem of representativity of the data. The reasoning is that the target population is covered by the data available. In other words, the data is exhaustive. However, as in any database, undercoverage, overcoverage and multicoverage (double counting etc.) may occur. The question of how representative a data set is for a particular target population only comes up when sufficient external data from the target population is available. This external data may also be a very large data set and may suffer from similar coverage problems. The question is how to reconcile the two data sets and what can be said about the representativity for a target population. If an individual indicator is present in both data sources, dual system estimation may be a method to determine the coverage of the target population. However, often units are not identified individually in both data sets. Then the comparison must be made on the distributions of common variables in the style of the typical representativity study for surveys.

Internal quality:
Since Big Data is often established for another purpose than the objectives of a particular study the data on individual units contain missing values and may be erroneous. The data, therefore, undergoes a data preparation process to make the data amenable to statistical analysis. This may involve merging of data, checks on missing values and ranges, checks on inconsistencies and outlier detection as well as subsequent treatment including imputations and elimination of observations. This data preparation process is similar to the data preparation process in survey statistics. Often the data preparation process involves several stages. In order to gain information about the data preparation process including the different stages and its impact on final results the Quality Team of Eurostat (2014) proposed indicators and Hulliger and Berdugo (2015) implemented a subset of the indicators. It is not clear yet what indicators are applicable for Big Data preparation processes. A particular issue during the data preparation process is the investigation of extraneous or extreme cases in Big Data. For surveys this problem is handled by outlier detection methods. Univariate and multivariate robust methods which take missing values and particularities of the survey design into account have been developed (see, e.g., Bill and Hulliger 2016). Because with Big Data more pre-processing may introduce more extreme values and due to the large size of the data instead of single outliers the investigation of the tails of distributions becomes feasible. This opens new possibilities of determining outliers. However, the problem of the choice of tuning constants or thresholds for extreme values and their treatment becomes even more difficult in Big Data because the trade-off between bias and variance typically is only driven by the bias.