How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Eva J.I. Hoeijmakers; Bibi Martens; Babs M.F. Hendriks; Casper Mihl; Razvan L. Miclea; Walter H. Backes; Joachim E. Wildberger; Frank M. Zijta; Hester A. Gietema; Patricia J. Nelemans; Cécile R.L.P.N. Jeukens

doi:10.1007/s00330-023-10493-7

How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale

Eva J.I. Hoeijmakers^*, Bibi Martens, Babs M.F. Hendriks, Casper Mihl, Razvan L. Miclea, Walter H. Backes, Joachim E. Wildberger, Frank M. Zijta, Hester A. Gietema, Patricia J. Nelemans, Cécile R.L.P.N. Jeukens

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Objectives: The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans. Methods: Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC). Results: Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396–0.814) to 0.785 (95%CI 0.676–0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034–0.500) to 0.562 (95%CI 0.337–0.729). Intraobserver agreement increased for four out of six observers. Conclusion: The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement. Clinical relevance statement: This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms. Key Points: • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies. Graphical Abstract: [Figure not available: see fulltext.].

Original language	English
Number of pages	10
Journal	European Radiology
DOIs	https://doi.org/10.1007/s00330-023-10493-7
Publication status	E-pub ahead of print - 1 Jan 2024

Keywords

Computed tomography (X-ray)
Interobserver variability
Intraobserver variability

Access to Document

10.1007/s00330-023-10493-7Licence: CC BY

Cite this

Hoeijmakers, E. J. I., Martens, B., Hendriks, B. M. F., Mihl, C., Miclea, R. L., Backes, W. H., Wildberger, J. E., Zijta, F. M., Gietema, H. A., Nelemans, P. J., & Jeukens, C. R. L. P. N. (2024). How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale. European Radiology. Advance online publication. https://doi.org/10.1007/s00330-023-10493-7

@article{767317a49fd44bb1b5405c41113d3182,

title = "How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale",

abstract = "Objectives: The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans. Methods: Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC). Results: Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396–0.814) to 0.785 (95%CI 0.676–0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034–0.500) to 0.562 (95%CI 0.337–0.729). Intraobserver agreement increased for four out of six observers. Conclusion: The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement. Clinical relevance statement: This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms. Key Points: • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies. Graphical Abstract: [Figure not available: see fulltext.].",

keywords = "Computed tomography (X-ray), Interobserver variability, Intraobserver variability",

author = "Hoeijmakers, {Eva J.I.} and Bibi Martens and Hendriks, {Babs M.F.} and Casper Mihl and Miclea, {Razvan L.} and Backes, {Walter H.} and Wildberger, {Joachim E.} and Zijta, {Frank M.} and Gietema, {Hester A.} and Nelemans, {Patricia J.} and Jeukens, {C{\'e}cile R.L.P.N.}",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",

year = "2024",

month = jan,

day = "1",

doi = "10.1007/s00330-023-10493-7",

language = "English",

journal = "European Radiology",

issn = "0938-7994",

publisher = "Springer, Cham",

}

TY - JOUR

T1 - How subjective CT image quality assessment becomes surprisingly reliable

T2 - pairwise comparisons instead of Likert scale

AU - Hoeijmakers, Eva J.I.

AU - Martens, Bibi

AU - Hendriks, Babs M.F.

AU - Mihl, Casper

AU - Miclea, Razvan L.

AU - Backes, Walter H.

AU - Wildberger, Joachim E.

AU - Zijta, Frank M.

AU - Gietema, Hester A.

AU - Nelemans, Patricia J.

AU - Jeukens, Cécile R.L.P.N.

PY - 2024/1/1

Y1 - 2024/1/1

N2 - Objectives: The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans. Methods: Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC). Results: Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396–0.814) to 0.785 (95%CI 0.676–0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034–0.500) to 0.562 (95%CI 0.337–0.729). Intraobserver agreement increased for four out of six observers. Conclusion: The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement. Clinical relevance statement: This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms. Key Points: • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies. Graphical Abstract: [Figure not available: see fulltext.].

AB - Objectives: The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans. Methods: Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC). Results: Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396–0.814) to 0.785 (95%CI 0.676–0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034–0.500) to 0.562 (95%CI 0.337–0.729). Intraobserver agreement increased for four out of six observers. Conclusion: The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement. Clinical relevance statement: This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms. Key Points: • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies. Graphical Abstract: [Figure not available: see fulltext.].

KW - Computed tomography (X-ray)

KW - Interobserver variability

KW - Intraobserver variability

U2 - 10.1007/s00330-023-10493-7

DO - 10.1007/s00330-023-10493-7

M3 - Article

SN - 0938-7994

JO - European Radiology

JF - European Radiology

ER -