## Abstract

In multilevel populations, there are two types of population means of an outcome variable ie, the average of all individual outcomes ignoring cluster membership and the average of cluster-specific means. To estimate the first mean, individuals can be sampled directly with simple random sampling or with two-stage sampling (TSS), that is, sampling clusters first, and then individuals within the sampled clusters. When cluster size varies in the population, three TSS schemes can be considered, ie, sampling clusters with probability proportional to cluster size and then sampling the same number of individuals per cluster; sampling clusters with equal probability and then sampling the same percentage of individuals per cluster; and sampling clusters with equal probability and then sampling the same number of individuals per cluster. Unbiased estimation of the average of all individual outcomes is discussed under each sampling scheme assuming cluster size to be informative. Furthermore, the three TSS schemes are compared in terms of efficiency with each other and with simple random sampling under the constraint of a fixed total sample size. The relative efficiency of the sampling schemes is shown to vary across different cluster size distributions. However, sampling clusters with probability proportional to size is the most efficient TSS scheme for many cluster size distributions. Model-based and design-based inference are compared and are shown to give similar results. The results are applied to the distribution of high school size in Italy and the distribution of patient list size for general practices in England.

Original language | English |
---|---|

Pages (from-to) | 1817-1834 |

Number of pages | 18 |

Journal | Statistics in Medicine |

Volume | 38 |

Issue number | 10 |

Early online date | 21 Dec 2018 |

DOIs | |

Publication status | Published - 10 May 2019 |

## Keywords

- design-based inference
- hierarchical population
- informative cluster size
- model-based inference
- two-stage sampling
- SCHOOL CONNECTEDNESS
- PROBABILITIES
- INFERENCE
- MODEL