BigSurv18 program


Wednesday 24th October Thursday 25th October Friday 26th October Saturday 27th October





Can We Mix It? Big Data Tools, Social Network Analysis, and Causal Inference

Chair Dr Thomas Emery (NIDI)
TimeSaturday 27th October, 16:00 - 17:30
Room: 40.063

Learning on Survey Data to Qualify Big Data in a Web Environment

Mrs Lucie Duprat (Mediametrie) - Presenting Author
Mr Claudio Barros (Mediametrie)
Ms Aurélie Vanheuverzwyn (Mediametrie)

Download presentation

In the online advertising industry, one of the key challenges is to provide the right ad to the right user. In this context, a group of online publishers wants to improve its cookie profiling. Each publisher has one or more website(s) and a tag is implemented on each page of its websites to follow the cookie navigation. The tag is a code that sends a log to a database when a user visits any URL of the perimeter. This log contains a cookie ID, the timestamp of the visit and the URL visited.

The purpose of the project is to attribute a socio-demographic profile with an age bracket and a gender for each cookie. As there are new cookies and new online navigation for the already qualified cookies every day, this qualification should be updated daily.

To address this issue, we have created a model with supervised learning. Indeed Mediametrie is the benchmark in audience measurement in France for the TV, the radio and the Internet. To measure Internet audience on computers, Mediametrie Net Ratings has a panel of 18,000 individuals that is representative of the French population with an Internet access. Connections are measured using meter software installed on their computers that feeds data back to Mediametrie’s servers. Thus we have full access to our panelists’ internet usage and we also know their socio-demographic profile. This is our learning database.

From this database, the first step was to create multiple features from the timestamp and the URL (the only information available from the tag). First we create features relative to the day and the time slot of the visit, to the domain visited and the presence of some keywords in the URL. We then use different natural language processing methods to analyse the URLs and create several word clusters and URL clusters. At the end of the feature engineering we have more than 1,000 features.

The second step was to test different structures of models (for example, first predict the gender and then use it to predict the age) and different algorithms of qualification. We have created a workflow in Python that selects the discriminant features with a random forest, compares scikit-learn and XGBoost algorithms for the qualification, tests different combinations of these algorithms and exports files to supervise the results and to tune the parameters.

When the combination and the parameters are fixed, we have to apply the model to all the logs received from the tag. We first apply some filters to work on a perimeter comparable to the panel, then create all the features and apply the model. This processing is done in PySpark due to the large volume of data involved.

In a production process, the model is updated every month and the cookies are qualified every day. We will launch the first advertising campaign with this qualification in the next months.


Surveys and Big Data for Estimating Brand Lift

Dr Tim Hesterberg (Google)
Dr Kyra Singh (Google)
Dr Ying Liu (Google) - Presenting Author
Dr Lu Zhang (Google)
Dr Rachel Fan (Google)
Dr Mike Wurm (Google)

Google Brand Lift Surveys estimates the effect of display advertising using surveys. Challenges include imperfect A/B experiments, response and solicitation bias, discrepancy between intended and actual treatment, comparing treatment group users who took an action with control users who might have acted, and estimation for different slices of the population. We approach these issues using a combination of individual-study analysis and meta-analysis across thousands of studies. This work involves a combination of small and large data - survey responses and logs data, respectively.

There are a number of interesting and even surprising methodological twists. We use regression to handle imperfect A/B experiments and response and solicitation biases; we find regression to be more stable than propensity methods. We use a particular form of regularization that combines advantages of L1 regularization (better predictions) and L2 (smoothness). We use a variety of slicing methods, that estimate either incremental or non-incremental effects of covariates like age and gender that may be correlated. We bootstrap to obtain standard errors. In contrast to many regression settings, where one may either resample observations or fix X and resample Y, here only resampling observations is appropriate.


Finding Friends - A Network Approach to Geo-Locating Twitter Users

Mr Niklas M. Loynes (NYU / University of Manchester) - Presenting Author
Professor Jonathan Nagler (NYU)
Dr Andreu Casas (NYU)
Ms Nicole Baram (NYU)

A rapidly growing body of research in political science uses data from social networking sites (SNS) such as Twitter to study political behavior, including protest mobilization, opinion formation, ideological polarization, and agenda setting. A key critique of this type of research is that the underlying demographics and characteristics of the population of Twitter users are mostly unknown, as are the characteristics of any sample of Twitter users. This makes drawing population inferences, as well as the analysis of particular subgroups, extremely challenging.

In this paper we present a new method for estimating the location of Twitter users. Many users provide location information in their user-supplied meta-data; but for those who do not we use information from their "reciprocal follower-friend network" to estimate the country, region, and local municipality. The underlying theoretical assumption of this approach is that Twitter networks are geographically clustered, and that we can identify where a user is located based on the sets of known locations of their follower-friend network. As many members of the follower-friend network are likely to provide location information in Twitter metadata, we argue that it is possible to accurately identify the location of most twitter users - including those who provide no meaningful location information in their self-supplied metadata.

Using a sample of over 400,000 twitter users, we demonstrate that we are able to accurately identify the specific state in the United States (as well as the equivalent administrative sub-divisions in other countries) that twitter users live in with considerable accuracy, and also demonstrate a method for identifying their congressional district. We showcase the accuracy of the method by comparing our estimated locations to the locations of users who voluntarily reveal their location in their metadata, as well as to users who have reported their home locations in surveys, for respondents from the United States and the United Kingdom.



Analyzing Big and Small Collections of Books With Network Coincidence Analysis

Mr Modesto Escobar (Universidad de Salamanca)
Mr Luis Martínez-Uribe (Universidad de Salamanca & Fundación Juan March) - Presenting Author
Mr Carlos Prieto (Universidad de Salamanca)
Mr David Barrios (Universidad de Salamanca)

Download presentation

Given the emergence of big data generated by massive digitization, as well as the growing access to information from the so-called second digital revolution, social scientists face a number of methodological challenges to better understand social life: data collection, new ways of sampling, automatic coding and statistical analysis of information.

This presentation proposes the analysis of information based on data binarization. The idea is to build three-dimensional binary matrices formed by 1) temporal or spatial sets, 2) scenarios and 3) events or characteristics, supported by matrices with their attributes. The treatment of this structure is based on the methodology of two/three-mode networks, combined with statistical tools for selection and location of nodes, and representation of edges.

The communication will be illustrated with examples and strategies for the extraction and graphical interactive analysis of library data (British Library, Library of Congress, ...), as well as with treatments of smaller bibliographies. Similarly, an R-package will be introduced to perform these analyses and present them on web pages in a visual and interactive way.


Climatic Visual Art for Farmers Insights

Mr John Lunalo (Evidence Action, WorldQuant University) - Presenting Author
Mr Elvis Karanja (Kenya Markets Trust)

Introduction

Crop modelling and agricultural experimental designs entail a range of activities among them is to work with climatic data to inform farmers on better and modernised scientific agricultural practices. This can be less achieved without a tool for data exploration and analysis. Working with data is becoming rarely uncommon in many sectors. Data insights drives every agricultural informed decision making. To contribute towards enriching farmers with better insightful tools, we are designing an application that can aid farmers visualize and summarise climatic data. We are using R software’s Shinyapp package to create an interactive user interface while leveraging on R’s visualization and data manipulation capabilities to ensure powerful munging, exploration and analyses.

Methodology

User Interface Design

Having easy to use interface is key for any program. To achieve this, we are leveraging on readily available skill of experienced software developers, agricultural industry trainers and our friends who are farmers.
We are making use of shinyApp package in R to ensure standard controls positioning on our graphical user interface (GUI). GUI is split into two main groups- Visualization part for creating summary charts such as rainfall variations with boxplots and histograms and inventory plots for showing start and end of rainy seasons and displaying daily unrecorded measurements. Data summary and data view area allow farmers to view descriptive statistics and subsets of data for many years and stations. Data filtering and reshaping is also implemented using unique set of control system. Power of ShinyThemes and CSS enables us to provide user with multiple desirable themes options for farmers to interact with ease.

Backend Implementation

Data Wrangling, visualizations and summary are being powered by Tidyverse packages by Hadley Wickham together with other sets of inbuilt flow control in R software such as error capturing to ensure stability of the application.

Results

Developing an easy to use application but powerful in data analytics will see farmers derive meaning from data on their own with less support from the technical team at zero costs. This will in a long way transform agricultural practices especially in developing countries.

Conclusions

Simple, Easy but powerful visualization system is valuable tool in agricultural sector since most of the stakeholders in Agriculture have less or no background of working with data. Our application is just among many applications that are aided towards cutting cost of processing data into valuable information for farmers who will be excited to use scientific approaches in their practices. Scientific approaches in food production are key to realize some of the sustainable development goals of Zero Hunger and No Poverty.