TY - JOUR
T1 - Towards clinical implementation of automated segmentation of vestibular schwannomas
T2 - a reliability study comparing AI and human performance
AU - Cornelissen, Stefan
AU - Schouten, Sammy M.
AU - Langenhuizen, Patrick P. J. H.
AU - Kunst, Henricus P. M.
AU - Verheul, Jeroen B.
AU - De With, Peter H. N.
PY - 2025/4
Y1 - 2025/4
N2 - Purpose To evaluate the clinimetric reliability of automated vestibular schwannoma (VS) segmentations by a comparison with human inter-observer variability on T1-weighted contrast-enhanced MRI scans. Methods This retrospective study employed MR images, including follow-up, from 1,015 patients (median age: 59, 511 men), resulting in 1,856 unique scans. Two nnU-Net models were trained using fivefold cross-validation to create a single-center segmentation model, along with a multi-center model using additional publicly available data. Geometric-based segmentation metrics (e.g. the Dice score) were used to evaluate model performance. To quantitatively assess the clinimetric reliability of the models, automated tumor volumes from a separate test set were compared to human inter-observer variability using the limits of agreement with the mean (LOAM) procedure. Additionally, new agreement limits that include automated annotations are calculated. Results Both models performed comparable to current state-of-the-art VS segmentation models, with median Dice scores of 91.6% and 91.9% for the single and multi-center models, respectively. There is a stark difference in clinimetric performance between both models: automated tumor volumes of the multi-center model fell within human agreement limits in 73% of the cases, compared to 44% for the single-center model. Newly calculated agreement limits including the single-center model, resulted in very high and wide limits. For the multi-center model, the new agreement limits were comparable to human inter-observer variability. Conclusion Models with excellent geometric-based metrics do not necessarily imply high clinimetric reliability, demonstrating the need to clinimetrically evaluate models as part of the clinical implementation process. The multi-center model displayed high reliability, warranting its possible future use in clinical practice. However, caution should be exercised when employing the model for small tumors, as the reliability was found to be volume-dependent.
AB - Purpose To evaluate the clinimetric reliability of automated vestibular schwannoma (VS) segmentations by a comparison with human inter-observer variability on T1-weighted contrast-enhanced MRI scans. Methods This retrospective study employed MR images, including follow-up, from 1,015 patients (median age: 59, 511 men), resulting in 1,856 unique scans. Two nnU-Net models were trained using fivefold cross-validation to create a single-center segmentation model, along with a multi-center model using additional publicly available data. Geometric-based segmentation metrics (e.g. the Dice score) were used to evaluate model performance. To quantitatively assess the clinimetric reliability of the models, automated tumor volumes from a separate test set were compared to human inter-observer variability using the limits of agreement with the mean (LOAM) procedure. Additionally, new agreement limits that include automated annotations are calculated. Results Both models performed comparable to current state-of-the-art VS segmentation models, with median Dice scores of 91.6% and 91.9% for the single and multi-center models, respectively. There is a stark difference in clinimetric performance between both models: automated tumor volumes of the multi-center model fell within human agreement limits in 73% of the cases, compared to 44% for the single-center model. Newly calculated agreement limits including the single-center model, resulted in very high and wide limits. For the multi-center model, the new agreement limits were comparable to human inter-observer variability. Conclusion Models with excellent geometric-based metrics do not necessarily imply high clinimetric reliability, demonstrating the need to clinimetrically evaluate models as part of the clinical implementation process. The multi-center model displayed high reliability, warranting its possible future use in clinical practice. However, caution should be exercised when employing the model for small tumors, as the reliability was found to be volume-dependent.
KW - Vestibular schwannoma
KW - Automatic segmentation
KW - Clinimetric reliability
KW - Inter-observer variability
KW - MRI
KW - MANAGEMENT
KW - COHORT
U2 - 10.1007/s00234-025-03611-3
DO - 10.1007/s00234-025-03611-3
M3 - Article
SN - 0028-3940
VL - 67
SP - 1049
EP - 1059
JO - Neuroradiology
JF - Neuroradiology
IS - 4
ER -