The logistic generalized estimating equations (logistic-gee) models have been extensively used for analyzing clustered binary data. However, assessing the goodness-of-fit and predictability of these models is problematic due to the fact that no likelihood is available and the observations can be correlated within a cluster. In this paper we propose a new measure for estimating the generalization performance of the logistic gee models, namely ranking accuracy for models based on clustered data (ramcd). We define ramcd as the probability that a randomly selected positive observation is ranked higher than randomly selected negative observation from another cluster. We propose a computationally efficient algorithm for ramcd. The algorithm can be applied for two cases: (1) when we estimate ramcd as a goodness-of-fit criterion and (2) when we estimate ramcd as a predictability criterion. This is experimentally shown on clustered data from a simulation study and a biomarkers’ study.
|Title of host publication||IDA 2016: Advances in Intelligent Data Analysis XV |
|Editors||H Boström, A Knobbe, C Soares, P Papapetrou|
|Publication status||Published - 21 Sep 2016|
|Series||Lecture Notes in Computer Science|