BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Refining Big Data Methods Using Survey Data

Chair Dr Georgiy Bobashev (RTI International)
TimeSaturday 27th October, 11:00 - 12:30
Room: 40.008

The Effect of Survey Measurement Error on Clustering Algorithms

Ms Paulina Pankowska (Vrije Universiteit Amsterdam) - Presenting Author
Dr Daniel Oberski (Utrecht University)
Dr Dimitris Pavlopoulos (Vrije Universiteit Amsterdam)

Download presentation

Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful.

Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available.

In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.

References

Aggarwal, C. C. (2009). Managing and Mining Uncertain Data. Advances in Database Systems, vol. 35.

Aggarwal, C. C., & Reddy, C. K. (Eds.). (2013). Data clustering: algorithms and applications. CRC press.

Boeschoten, L., Oberski, D. L., de Waal, T. A. G., & Vermunt, J. K.. (in press). Updating latent class imputations with external auxiliary variables. Structural Equation Modeling.

Kaufman, L., and P. J. Rousseeuw. (2005) "Finding Groups in Data: An Introduction to Cluster Analysis". Hoboken, NJ: Wiley.


Efficiency of Classification Algorithms as an Alternative to Logistic Regression in Propensity Score Adjustment for Survey Weighting

Mr Ramón Ferri-García (Department of Statistics and Operations Research, University of Granada) - Presenting Author
Professor María del Mar Rueda (Department of Statistics and Operations Research, University of Granada)

Download presentation

Propensity Score Adjustment (PSA) is a widely used technique for removing selection bias in online surveys. This method, originally developed in order to reduce the bias due to treatment and control assignment in experimental design (Rosenbaum and Rubin, 1983), has been used in online surveys treatment since more than a decade ago (Taylor, 2000; Taylor et al., 2001). In this context, PSA approach is focused on narrowing the demographic differences between the population with Internet access and the general population, which produce a selection bias in online surveys that is inherent to the survey method itself. If a probabilistic reference sample obtained from an adequate sampling frame is available, the propensity of a respondent to participate in the non-probabilistic survey can be estimated through PSA. This estimation takes into account the differences between both populations through the covariates that could be measured in both samples. PSA efficiency has been proved (Lee, 2006; Lee and Valliant, 2009; Valliant and Dever, 2011) when covariates are related to the target variable or the mechanism behind the differences between online and offline population.

Propensity estimation in PSA is done through a logistic regression model on a binary variable which measures whether the respondent belongs to the online sample or the reference sample. This approach is able to detect and measure linear relationships between covariates and the target variable. However, in many cases those relationships can be non-linear, which results in a less reliable estimation of the propensity. Classification algorithms developed in machine learning can be considered as an alternative to logistic regression, given that most of them are able to reflect non-linear and more complex relationships and obtain more accurate propensity estimates. Their efficacy in PSA has been proved in the experimental design context (Lee et al., 2010; Westreich et al., 2010).

In this work, PSA efficiency was measured using logistic regression and several classification algorithms in machine learning, based on various approaches: decision trees, nearest neighbors, naive Bayes and boosting. The comparison between methods was done by applying PSA on a simulated population from which a probabilistic sample of general population members and a non-probabilistic sample of Internet users were drawn. The demographic variables created for the simulation were used for PSA in different combinations in order to study how relationships between covariates and participation in the online survey influence the performance of PSA on the aforementioned situations.

Results showed that PSA with classification algorithms could contribute to reduce the selection bias to a greater extent than PSA with logistic regression, although the usage of such algorithms could also be responsible for an increase of the estimators’ variance. In addition, results suggested that the adequacy of each machine learning approach is also dependent on the type of relationship of the target variable with the access to the Internet, this is, the mechanism influencing the participation on the non-probabilistic survey.


Accessing the Opinions of a Billion People: Mobile Surveys in the Age of Big Data

Mr Duncan Stannett (Qriously Ltd.) - Presenting Author

Qriously is a research startup that, using ad spaces in mobile apps conducts surveys. We call this method “programmatically sampling”. On advertising exchanges we compete for ad spaces by bidding in a time window of ~20 milliseconds. One challenge we have is the mobile population in most countries is not representative of the general population. If in the bidding process we treat every user as equally important it would result in a survey with incorrect distributions of demographics. This is expensive as we pay for ads we don’t need and it results in us retrospectively weighting our sample to a greater extent.

Making bidding decisions using machine learning has resulted in us getting cheaper and more representative samples. This talk will focus on how we used embeddings to engineer the data available to us into informative features for machine learning models.

Embeddings have applications in NLP (Google’s word2vec), recommendation engines (item2vec) and computer vision. This talk will introduce the idea of embeddings as a way to condense large sparse categorical information into continuous vector representations. The goal with this introduction will be to give the audience sufficient information so that they might be able to imagine uses in their own work.

The application we found for embeddings, vector representation of mobile apps for downstream model features, will then be explained. The presentation will cover the architecture of the networks used and the process and tools we used to optimise them. Before finally reporting the improvements to sampling in our method.


How YouTube Uses Survey Data to Improve Video Recommendations

Dr YouTube Recommendations Team Berg, Haulk, Marriott, McFadden (YouTube / Google) - Presenting Author

The YouTube recommendation system presents personalized video suggestions to viewers based on their viewing history with the goal of recommending videos that viewers will enjoy and value watching. Historically, the primary
measures of value have been based on product logs that record how viewers engage with videos - for example - which recommendations were watched and for how long. We define engagement metrics based on these facts and assume that
they provide a good indication of the value provided to the user. For example, recommendations that are watched are assumed to be better than those that are not, and videos that are watched for a longer period of time are assumed to
be enjoyed more.

In this paper, we describe how and why we use survey data to measure and improve the recommendations system. While log data shows what users do, survey data tells us what users say, providing us with very valuable new information
for measuring and improving the viewer experience of recommendations. We outline solutions to problems related to collecting and using survey data for personalized recommendations, including: (1) defining satisfaction metrics based on the survey responses we collect (2) a transfer-learning approach to integrating sparse survey data into the deep neural network at the heart of the YouTube recommendation system, and (3) a paired comparison experiment
design for measuring changes in survey-derived metrics. We discuss advantages and disadvantages of surveys delivered in YouTube and surveys delivered through Google Opinion Rewards app. Finally we describe the measurable improvements to YouTube recommendations achieved by combining survey methods with large scale machine learning.