Comparing Inference Methods for Non-probability Samples

Bart Buelens; Joep Burger; Jan A. van den Brakel

doi:10.1111/insr.12253

Comparing Inference Methods for Non-probability Samples

Bart Buelens^*, Joep Burger, Jan A. van den Brakel

^*Corresponding author for this work

QE Econometrics

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.

Original language	English
Pages (from-to)	322-343
Number of pages	22
Journal	International Statistical Review
Volume	86
Issue number	2
DOIs	https://doi.org/10.1111/insr.12253
Publication status	Published - 1 Aug 2018

Keywords

Algorithmic inference
big data
predictive modelling
pseudo-design-based estimation
DESIGN-BASED ANALYSIS
BIG DATA
WEB SURVEYS
OFFICIAL STATISTICS
PROPENSITY SCORE
POPULATIONS
ESTIMATORS
REGRESSION
SELECTION

Access to Document

10.1111/insr.12253

Cite this

@article{59d37a43aa67491684efd2acccb99cd7,

title = "Comparing Inference Methods for Non-probability Samples",

abstract = "Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.",

keywords = "Algorithmic inference, big data, predictive modelling, pseudo-design-based estimation, DESIGN-BASED ANALYSIS, BIG DATA, WEB SURVEYS, OFFICIAL STATISTICS, PROPENSITY SCORE, POPULATIONS, ESTIMATORS, REGRESSION, SELECTION",

author = "Bart Buelens and Joep Burger and {van den Brakel}, {Jan A.}",

note = "Data source : Register Online Kilometer Registratie van alle auto{\textquoteright}s die in Nederland geregistreerd staan (Rijksdienst voor het Wegvervoer). ",

year = "2018",

month = aug,

day = "1",

doi = "10.1111/insr.12253",

language = "English",

volume = "86",

pages = "322--343",

journal = "International Statistical Review",

issn = "0306-7734",

publisher = "International Statistical Institute",

number = "2",

}

TY - JOUR

T1 - Comparing Inference Methods for Non-probability Samples

AU - Buelens, Bart

AU - Burger, Joep

AU - van den Brakel, Jan A.

N1 - Data source : Register Online Kilometer Registratie van alle auto’s die in Nederland geregistreerd staan (Rijksdienst voor het Wegvervoer).

PY - 2018/8/1

Y1 - 2018/8/1

N2 - Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.

AB - Social and economic scientists are tempted to use emerging data sources like big data to compile information about finite populations as an alternative for traditional survey samples. These data sources generally cover an unknown part of the population of interest. Simply assuming that analyses made on these data are applicable to larger populations is wrong. The mere volume of data provides no guarantee for valid inference. Tackling this problem with methods originally developed for probability sampling is possible but shown here to be limited. A wider range of model-based predictive inference methods proposed in the literature are reviewed and evaluated in a simulation study using real-world data on annual mileages by vehicles. We propose to extend this predictive inference framework with machine learning methods for inference from samples that are generated through mechanisms other than random sampling from a target population. Describing economies and societies using sensor data, internet search data, social media and voluntary opt-in panels is cost-effective and timely compared with traditional surveys but requires an extended inference framework as proposed in this article.

KW - Algorithmic inference

KW - big data

KW - predictive modelling

KW - pseudo-design-based estimation

KW - DESIGN-BASED ANALYSIS

KW - BIG DATA

KW - WEB SURVEYS

KW - OFFICIAL STATISTICS

KW - PROPENSITY SCORE

KW - POPULATIONS

KW - ESTIMATORS

KW - REGRESSION

KW - SELECTION

U2 - 10.1111/insr.12253

DO - 10.1111/insr.12253

M3 - Article

SN - 0306-7734

VL - 86

SP - 322

EP - 343

JO - International Statistical Review

JF - International Statistical Review

IS - 2

ER -