Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test

Mark J. Gooding; Annamarie J. Smith; Maira Tariq; Paul Aljabar; Devis Peressutti; Judith van der Stoep; Bart Reymen; Daisy Emans; Djoya Hattu; Judith van Loon; Maud de Rooy; Rinus Wanders; Stephanie Peeters; Tim Lustberg; Johan van Soest; Andre Dekker; Wouter van Elmpt

doi:10.1002/mp.13200

Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test

Mark J. Gooding^*, Annamarie J. Smith, Maira Tariq, Paul Aljabar, Devis Peressutti, Judith van der Stoep, Bart Reymen, Daisy Emans, Djoya Hattu, Judith van Loon, Maud de Rooy, Rinus Wanders, Stephanie Peeters, Tim Lustberg, Johan van Soest, Andre Dekker, Wouter van Elmpt

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

200 Downloads (Pure)

Abstract

Purpose Methods Automated techniques for estimating the contours of organs and structures in medical images have become more widespread and a variety of measures are available for assessing their quality. Quantitative measures of geometric agreement, for example, overlap with a gold-standard delineation, are popular but may not predict the level of clinical acceptance for the contouring method. Therefore, surrogate measures that relate more directly to the clinical judgment of contours, and to the way they are used in routine workflows, need to be developed. The purpose of this study is to propose a method (inspired by the Turing Test) for providing contour quality measures that directly draw upon practitioners' assessments of manual and automatic contours. This approach assumes that an inability to distinguish automatically produced contours from those of clinical experts would indicate that the contours are of sufficient quality for clinical use. In turn, it is anticipated that such contours would receive less manual editing prior to being accepted for clinical use. In this study, an initial assessment of this approach is performed with radiation oncologists and therapists. Eight clinical observers were presented with thoracic organ-at-risk contours through a web interface and were asked to determine if they were automatically generated or manually delineated. The accuracy of the visual determination was assessed, and the proportion of contours for which the source was misclassified recorded. Contours of six different organs in a clinical workflow were for 20 patient cases. The time required to edit autocontours to a clinically acceptable standard was also measured, as a gold standard of clinical utility. Established quantitative measures of autocontouring performance, such as Dice similarity coefficient with respect to the original clinical contour and the misclassification rate accessed with the proposed framework, were evaluated as surrogates of the editing time measured. Results Conclusions The misclassification rates for each organ were: esophagus 30.0%, heart 22.9%, left lung 51.2%, right lung 58.5%, mediastinum envelope 43.9%, and spinal cord 46.8%. The time savings resulting from editing the autocontours compared to the standard clinical workflow were 12%, 25%, 43%, 77%, 46%, and 50%, respectively, for these organs. The median Dice similarity coefficients between the clinical contours and the autocontours were 0.46, 0.90, 0.98, 0.98, 0.94, and 0.86, respectively, for these organs. A better correspondence with time saving was observed for the misclassification rate than the quantitative contour measures explored. From this, we conclude that the inability to accurately judge the source of a contour indicates a reduced need for editing and therefore a greater time saving overall. Hence, task-based assessments of contouring performance may be considered as an additional way of evaluating the clinical utility of autosegmentation methods.

Original language	English
Pages (from-to)	5105-5115
Number of pages	11
Journal	Medical Physics
Volume	45
Issue number	11
DOIs	https://doi.org/10.1002/mp.13200
Publication status	Published - Nov 2018

Keywords

assessment
autocontouring
editing time
organs-at-risk
Turing test
SEGMENTATION SOFTWARE
ATLAS

Access to Document

10.1002/mp.13200

Full TextFinal published version, 605 KBLicence: Taverne

Cite this

Gooding, M. J., Smith, A. J., Tariq, M., Aljabar, P., Peressutti, D., van der Stoep, J., Reymen, B., Emans, D., Hattu, D., van Loon, J., de Rooy, M., Wanders, R., Peeters, S., Lustberg, T., van Soest, J., Dekker, A., & van Elmpt, W. (2018). Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test. Medical Physics, 45(11), 5105-5115. https://doi.org/10.1002/mp.13200

@article{c4729983ab7c492eb13f49dcd4b65a5a,

title = "Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test",

abstract = "Purpose Methods Automated techniques for estimating the contours of organs and structures in medical images have become more widespread and a variety of measures are available for assessing their quality. Quantitative measures of geometric agreement, for example, overlap with a gold-standard delineation, are popular but may not predict the level of clinical acceptance for the contouring method. Therefore, surrogate measures that relate more directly to the clinical judgment of contours, and to the way they are used in routine workflows, need to be developed. The purpose of this study is to propose a method (inspired by the Turing Test) for providing contour quality measures that directly draw upon practitioners' assessments of manual and automatic contours. This approach assumes that an inability to distinguish automatically produced contours from those of clinical experts would indicate that the contours are of sufficient quality for clinical use. In turn, it is anticipated that such contours would receive less manual editing prior to being accepted for clinical use. In this study, an initial assessment of this approach is performed with radiation oncologists and therapists. Eight clinical observers were presented with thoracic organ-at-risk contours through a web interface and were asked to determine if they were automatically generated or manually delineated. The accuracy of the visual determination was assessed, and the proportion of contours for which the source was misclassified recorded. Contours of six different organs in a clinical workflow were for 20 patient cases. The time required to edit autocontours to a clinically acceptable standard was also measured, as a gold standard of clinical utility. Established quantitative measures of autocontouring performance, such as Dice similarity coefficient with respect to the original clinical contour and the misclassification rate accessed with the proposed framework, were evaluated as surrogates of the editing time measured. Results Conclusions The misclassification rates for each organ were: esophagus 30.0%, heart 22.9%, left lung 51.2%, right lung 58.5%, mediastinum envelope 43.9%, and spinal cord 46.8%. The time savings resulting from editing the autocontours compared to the standard clinical workflow were 12%, 25%, 43%, 77%, 46%, and 50%, respectively, for these organs. The median Dice similarity coefficients between the clinical contours and the autocontours were 0.46, 0.90, 0.98, 0.98, 0.94, and 0.86, respectively, for these organs. A better correspondence with time saving was observed for the misclassification rate than the quantitative contour measures explored. From this, we conclude that the inability to accurately judge the source of a contour indicates a reduced need for editing and therefore a greater time saving overall. Hence, task-based assessments of contouring performance may be considered as an additional way of evaluating the clinical utility of autosegmentation methods.",

keywords = "assessment, autocontouring, editing time, organs-at-risk, Turing test, SEGMENTATION SOFTWARE, ATLAS",

author = "Gooding, {Mark J.} and Smith, {Annamarie J.} and Maira Tariq and Paul Aljabar and Devis Peressutti and {van der Stoep}, Judith and Bart Reymen and Daisy Emans and Djoya Hattu and {van Loon}, Judith and {de Rooy}, Maud and Rinus Wanders and Stephanie Peeters and Tim Lustberg and {van Soest}, Johan and Andre Dekker and {van Elmpt}, Wouter",

year = "2018",

month = nov,

doi = "10.1002/mp.13200",

language = "English",

volume = "45",

pages = "5105--5115",

journal = "Medical Physics",

issn = "0094-2405",

publisher = "Wiley",

number = "11",

}

Gooding, MJ, Smith, AJ, Tariq, M, Aljabar, P, Peressutti, D, van der Stoep, J, Reymen, B, Emans, D, Hattu, D, van Loon, J, de Rooy, M, Wanders, R, Peeters, S, Lustberg, T, van Soest, J , Dekker, A & van Elmpt, W 2018, 'Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test', Medical Physics, vol. 45, no. 11, pp. 5105-5115. https://doi.org/10.1002/mp.13200

TY - JOUR

T1 - Comparative evaluation of autocontouring in clinical practice

T2 - A practical method using the Turing test

AU - Gooding, Mark J.

AU - Smith, Annamarie J.

AU - Tariq, Maira

AU - Aljabar, Paul

AU - Peressutti, Devis

AU - van der Stoep, Judith

AU - Reymen, Bart

AU - Emans, Daisy

AU - Hattu, Djoya

AU - van Loon, Judith

AU - de Rooy, Maud

AU - Wanders, Rinus

AU - Peeters, Stephanie

AU - Lustberg, Tim

AU - van Soest, Johan

AU - Dekker, Andre

AU - van Elmpt, Wouter

PY - 2018/11

Y1 - 2018/11

N2 - Purpose Methods Automated techniques for estimating the contours of organs and structures in medical images have become more widespread and a variety of measures are available for assessing their quality. Quantitative measures of geometric agreement, for example, overlap with a gold-standard delineation, are popular but may not predict the level of clinical acceptance for the contouring method. Therefore, surrogate measures that relate more directly to the clinical judgment of contours, and to the way they are used in routine workflows, need to be developed. The purpose of this study is to propose a method (inspired by the Turing Test) for providing contour quality measures that directly draw upon practitioners' assessments of manual and automatic contours. This approach assumes that an inability to distinguish automatically produced contours from those of clinical experts would indicate that the contours are of sufficient quality for clinical use. In turn, it is anticipated that such contours would receive less manual editing prior to being accepted for clinical use. In this study, an initial assessment of this approach is performed with radiation oncologists and therapists. Eight clinical observers were presented with thoracic organ-at-risk contours through a web interface and were asked to determine if they were automatically generated or manually delineated. The accuracy of the visual determination was assessed, and the proportion of contours for which the source was misclassified recorded. Contours of six different organs in a clinical workflow were for 20 patient cases. The time required to edit autocontours to a clinically acceptable standard was also measured, as a gold standard of clinical utility. Established quantitative measures of autocontouring performance, such as Dice similarity coefficient with respect to the original clinical contour and the misclassification rate accessed with the proposed framework, were evaluated as surrogates of the editing time measured. Results Conclusions The misclassification rates for each organ were: esophagus 30.0%, heart 22.9%, left lung 51.2%, right lung 58.5%, mediastinum envelope 43.9%, and spinal cord 46.8%. The time savings resulting from editing the autocontours compared to the standard clinical workflow were 12%, 25%, 43%, 77%, 46%, and 50%, respectively, for these organs. The median Dice similarity coefficients between the clinical contours and the autocontours were 0.46, 0.90, 0.98, 0.98, 0.94, and 0.86, respectively, for these organs. A better correspondence with time saving was observed for the misclassification rate than the quantitative contour measures explored. From this, we conclude that the inability to accurately judge the source of a contour indicates a reduced need for editing and therefore a greater time saving overall. Hence, task-based assessments of contouring performance may be considered as an additional way of evaluating the clinical utility of autosegmentation methods.

AB - Purpose Methods Automated techniques for estimating the contours of organs and structures in medical images have become more widespread and a variety of measures are available for assessing their quality. Quantitative measures of geometric agreement, for example, overlap with a gold-standard delineation, are popular but may not predict the level of clinical acceptance for the contouring method. Therefore, surrogate measures that relate more directly to the clinical judgment of contours, and to the way they are used in routine workflows, need to be developed. The purpose of this study is to propose a method (inspired by the Turing Test) for providing contour quality measures that directly draw upon practitioners' assessments of manual and automatic contours. This approach assumes that an inability to distinguish automatically produced contours from those of clinical experts would indicate that the contours are of sufficient quality for clinical use. In turn, it is anticipated that such contours would receive less manual editing prior to being accepted for clinical use. In this study, an initial assessment of this approach is performed with radiation oncologists and therapists. Eight clinical observers were presented with thoracic organ-at-risk contours through a web interface and were asked to determine if they were automatically generated or manually delineated. The accuracy of the visual determination was assessed, and the proportion of contours for which the source was misclassified recorded. Contours of six different organs in a clinical workflow were for 20 patient cases. The time required to edit autocontours to a clinically acceptable standard was also measured, as a gold standard of clinical utility. Established quantitative measures of autocontouring performance, such as Dice similarity coefficient with respect to the original clinical contour and the misclassification rate accessed with the proposed framework, were evaluated as surrogates of the editing time measured. Results Conclusions The misclassification rates for each organ were: esophagus 30.0%, heart 22.9%, left lung 51.2%, right lung 58.5%, mediastinum envelope 43.9%, and spinal cord 46.8%. The time savings resulting from editing the autocontours compared to the standard clinical workflow were 12%, 25%, 43%, 77%, 46%, and 50%, respectively, for these organs. The median Dice similarity coefficients between the clinical contours and the autocontours were 0.46, 0.90, 0.98, 0.98, 0.94, and 0.86, respectively, for these organs. A better correspondence with time saving was observed for the misclassification rate than the quantitative contour measures explored. From this, we conclude that the inability to accurately judge the source of a contour indicates a reduced need for editing and therefore a greater time saving overall. Hence, task-based assessments of contouring performance may be considered as an additional way of evaluating the clinical utility of autosegmentation methods.

KW - assessment

KW - autocontouring

KW - editing time

KW - organs-at-risk

KW - Turing test

KW - SEGMENTATION SOFTWARE

KW - ATLAS

U2 - 10.1002/mp.13200

DO - 10.1002/mp.13200

M3 - Article

SN - 0094-2405

VL - 45

SP - 5105

EP - 5115

JO - Medical Physics

JF - Medical Physics

IS - 11

ER -