BigSurv20 program

Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December


Getting the combination right: Exploring the use of surveys with other data sources

Moderator: Barry Schouten ([email protected])
Slack link
Quick Zoom

Detailed zoom login information
Friday 20th November, 10:00 - 11:30 (ET, GMT-5)
7:00 - 8:30 (PT, GMT-8)
16:00 - 17:30 (CET, GMT+1)

Social networks on smartphones. Congruency of online and offline networks and their effect on labor market outcomes

Dr Sebastian Bähr (Institute for Employment Research (IAB)) - Presenting Author
Mr Georg-Christoph Haas (Institute for Employment Research (IAB))
Professor Florian Keusch (University of Mannheim)
Professor Frauke Kreuter (Institute for Employment Research (IAB) / University of Maryland)
Professor Mark Trappmann (Institute for Employment Research (IAB) / University of Bamberg)

Social networks play an important part in many dimensions of life, be it health or the labor market Networks enable the flow of information, access to critical gatekeepers and valuable resources, that all constitute forms of social capital. While social network research has been very creative in using new data sources, up to now, nationally representative surveys have not been combined with social network measures from smartphones. This might prove to be a blind spot, in the digital age, where smartphones are omnipresent and instrumental in the social connectedness of individuals.
While the emergence of the internet stimulated a lot of research about online networks and their impact on social embeddedness and social isolation, there is little knowledge about the congruence of offline and online social networks, even less so for social connectedness via smartphones and social networking site (SNS) apps. Most surveys rely on summary measures of network size or sampling techniques for network structure, but it is unclear whether respondents do consider their online social contacts in these measures. If a difference between the two network parts exists, the measurement error in studies relying on survey data alone could explain heterogeneous findings on the role of social networks in many research domains—at least for the increasingly large group of smartphone users.
We use data from the 2018 IAB-SMART-study, where we invited participants of the panel survey “Labour market and social security” (PASS) to a six-month-long app study. From the 651 participants, we collected daily data on app use, phonebook contacts regarding gender and ethnic composition, and call and text message logs. During the study period, we repeatedly surveyed respondents concerning their social networks and the congruence between online and offline social contacts.
With this data, we can map this as of now unobserved part of social networks and determine its intra-individual congruence with the network measures from the PASS survey. The embeddedness of the IAB-SMART-study in PASS allows accounting for the selectivity of participation in PASS, smartphone ownership and participation in the IAB-SMART study by using weights. Therefore, we can generalize our findings to German android smartphone owners and the general population.
Preliminary results for SNS-app usage show that these measures contribute network information that overlaps with PASS survey measures only to a small degree and for the most part constitute unique and so far, unobserved network information. Individual apps fulfill a multitude of functions, from passive social media consumption to dating and straightforward communication. Combining our survey and observational app use data, we can identify the relevant apps for social embeddedness.
For our presentation at BigSurv20 in November, we will also prepare information on the social networks levied from the participants’ phone and text message logs. Together, app use and call log data provide an encompassing picture of the social embeddedness on smartphones, that will allow us to challenge existing knowledge and help us to gain new insights into the structure and effects of social Networks.

A novel approach to combine survey and bibliometric data for science policy research

Dr Samson Adeshiyan (NSF/NCSES) - Presenting Author
Ms Jodi Basner (Clarivate Analytics)
Dr Wan-Ying Chang (NSF/NCSES)

To advance understanding of sociodemographic contributions to scientific research and to assess outcomes of the nation’s strategic investment in graduate education and postdoctoral research training, a novel approach that combines machine learning and author disambiguation techniques is used to link respondents of the Survey of Doctorate Recipients (SDR) to the publication database Web of Science (WoS). The ability to examine bibliometric data with well-curated survey data rich in demographic, education, and career history information is unique to the SDR-WoS dataset. In this paper, we discuss how our novel data linkage approach is used to investigate demographic differences in publication output and research collaborations of the highly-trained doctoral workforce.
The SDR is conducted every 2 years by the National Center for Science and Engineering Statistics within the U.S. National Science Foundation. This longitudinal survey follows recipients of research doctorates from U.S. institutions until age 76. SDR data are used by employers in the education, industry, and government sectors to understand and predict trends in employment opportunities and salaries for doctorate holders. About 78,000 U.S.-trained PhDs were matched to the authors of publications indexed by the WoS. The matching process took place in two stages. First, matching is conducted by applying gradient boosting machines trained on carefully constructed training data to identify publications which could be matched to SDR respondents with a high degree of confidence. Second, those matches are further matched to authorship clusters to expand the overall recall rate. We explored different methods to improve efficiency and quality of the training data. Extensive analysis on matching prediction was conducted for important subgroups at each stage to determine a good balance in combining the two stage outcomes to reach the desirable overall precision and recall rate.
We conduct analysis on demographic differences in publication output and research collaboration to highlight the potential and unique contribution of the SDR-WoS linked data While diversity in the scientific workforce is essential for a robust scientific system men and majority populations are still disproportionately represented in science and engineering. In the U.S., women and underrepresented minority groups—blacks or African Americans, Hispanics or Latinos, and American Indians or Alaska Natives—are underrepresented among PhD researchers. Other strong disparities are also observed in publication output, a key indicator of involvement in scientific research.
The most accurate source of gender and race data would be self-reported. Several bibliometric studies have examined disparities by gender and race/ethnicity as determined by predictive algorithms, whose accuracy vary dramatically by country. However, there are few datasets that combine both sociodemographic data and research activity. Some creative solutions that use biosketches in grant proposals and matching tax data to patenting data as a source of sociodemographic information have been proposed but none provide a full analysis of research activity by gender and race. Our analysis overcomes the need for proxies by using self-reported gender, race/ethnicity, U.S. citizenship status, doctoral training and employment outcomes for a large-

LEARN4SDGis–A machine learning based poverty mapping exercise in Austria

Mr Johannes Gussenbauer (Statistics Austria) - Presenting Author
Mr Thomas Glaser (Statistics Austria)
Mrs Ingrid Kaminger (Statistics Austria)
Dr Alexander Kowarik (Statistics Austria)
Mrs Sibylle Saul (Statistics Austria)
Dr Matthias Till (Statistics Austria)
Mrs Alexandra Wegscheider-Pichler (Statistics Austria)

This contribution informs on the final results of the LEARN4SDGis project, which was funded by a EUROSTAT grant on merging geospatial information and statistics. The project aimed to extend the use of geographic information systems to sample data for which detailed regional estimates are normally not available. In particular this refers to important national indicators for the global Sustainable Development Goals (SDGs) on poverty, health and education. The project lead to an atlas application with an enhanced geographic resolution, compared to direct estimates from the relatively small survey samples in the social statistics domain.

This was achieved by using machine learning algorithms which integrate sample information with new data sources on spatial distributions and registers. The approach proved particularly successful for poverty maps, derived from EU-SILC data.

Subnational estimates from survey data are in high demand but scarcely available due to sample size restrictions. Activities in the field of small area consistently prove that precision enhancement is possible with several methods as long as reliable auxiliary information is available. The contribution explains how several machine learning algorithms were trained upon EU-SILC sample data to predict the poverty status of each individual in the population. Algorithms can rely on extensive register information on work and transfer incomes which make up 80% of total income in EU-SILC. While major income components of individuals appear well covered in register data, household composition represented in such data is not fully coherent with the actual living circumstances of respondents which implies that regional poverty rates need to be approximated. In this process, information on the spatial distribution of potentially predictive variables (including accessibility) can be considered.

The different algorithms are evaluated using cross validation and to give an indication of the variance of the different approaches. The results suggest that predictions on unit level work remarkably well but certain geographical patterns remain suspicious for dissemination. Suppression rules had to be developed. For instance, some urban regions as well as frontier regions (possibly rated to labour mobility across borders) appear to exhibit implausibly high poverty rates. These are likely to be artefacts and may require strengthening the spatial component in the estimation (e.g. by shrinkage).

This contribution presents methods as well as the interactive atlas with its flexible presentation of results between raster, enumeration districts or higher aggregations such as NUTS regions. The discussion shall include challenges with regard to privacy and the political sensitivity in disseminating such “experimental” statistics.