Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning - Success story or dataset bias?

J. Dhont, C. Wolfs, F. Verhaegen*

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Purpose Over the last 2 years, the artificial intelligence (AI) community has presented several automatic screening tools for coronavirus disease 2019 (COVID-19) based on chest radiography (CXR), with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. The purpose of this study was to thoroughly investigate to what extent biases have influenced the performance of a range of previously proposed and promising convolutional neural networks (CNNs), and to determine what performance can be expected with current CNNs on a realistic and unbiased dataset. Methods Five CNNs for COVID-19 positive/negative classification were implemented for evaluation, namely VGG19, ResNet50, InceptionV3, DenseNet201, and COVID-Net. To perform both internal and cross-dataset evaluations, four datasets were created. The first dataset Valencian Region Medical Image Bank (BIMCV) followed strict reverse transcriptase-polymerase chain reaction (RT-PCR) test criteria and was created from a single reliable open access databank, while the second dataset (COVIDxB8) was created through a combination of six online CXR repositories. The third and fourth datasets were created by combining the opposing classes from the BIMCV and COVIDxB8 datasets. To decrease inter-dataset variability, a pre-processing workflow of resizing, normalization, and histogram equalization were applied to all datasets. Classification performance was evaluated on unseen test sets using precision and recall. A qualitative sanity check was performed by evaluating saliency maps displaying the top 5%, 10%, and 20% most salient segments in the input CXRs, to evaluate whether the CNNs were using relevant information for decision making. In an additional experiment and to further investigate the origin of potential dataset bias, all pixel values outside the lungs were set to zero through automatic lung segmentation before training and testing. Results When trained and evaluated on the single online source dataset (BIMCV), the performance of all CNNs is relatively low (precision: 0.65-0.72, recall: 0.59-0.71), but remains relatively consistent during external evaluation (precision: 0.58-0.82, recall: 0.57-0.72). On the contrary, when trained and internally evaluated on the combinatory datasets, all CNNs performed well across all metrics (precision: 0.94-1.00, recall: 0.77-1.00). However, when subsequently evaluated cross-dataset, results dropped substantially (precision: 0.10-0.61, recall: 0.04-0.80). For all datasets, saliency maps revealed the CNNs rarely focus on areas inside the lungs for their decision-making. However, even when setting all pixel values outside the lungs to zero, classification performance does not change and dataset bias remains. Conclusions Results in this study confirm that when trained on a combinatory dataset, CNNs tend to learn the origin of the CXRs rather than the presence or absence of disease, a behavior known as short-cut learning. The bias is shown to originate from differences in overall pixel values rather than embedded text or symbols, despite consistent image pre-processing. When trained on a reliable, and realistic single-source dataset in which non-lung pixels have been masked, CNNs currently show limited sensitivity (<70%) for COVID-19 infection in CXR, questioning their use as a reliable automatic screening tool.
Original languageEnglish
Pages (from-to)978-987
Number of pages10
JournalMedical Physics
Volume49
Issue number2
Early online date12 Jan 2022
DOIs
Publication statusPublished - Feb 2022

Keywords

  • artificial intelligence
  • COVID-19
  • dataset bias
  • X-ray imaging
  • CLASSIFIER
  • RADIOLOGY

Cite this