Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology

Narmin Ghaffari Laleh; Hannah Sophie Muti; Chiara Maria Lavinia Loeffler; Amelie Echle; Oliver Lester Saldanha; Faisal Mahmood; Ming Y Lu; Christian Trautwein; Rupert Langer; Bastian Dislich; Roman D Buelow; Heike Irmgard Grabsch; Hermann Brenner; Jenny Chang-Claude; Elizabeth Alwers; Titus J Brinker; Firas Khader; Daniel Truhn; Nadine T Gaisa; Peter Boor; Michael Hoffmeister; Volkmar Schulz; Jakob Nikolas Kather

doi:10.1016/j.media.2022.102474

Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology

Narmin Ghaffari Laleh, Hannah Sophie Muti, Chiara Maria Lavinia Loeffler, Amelie Echle, Oliver Lester Saldanha, Faisal Mahmood, Ming Y Lu, Christian Trautwein, Rupert Langer, Bastian Dislich, Roman D Buelow, Heike Irmgard Grabsch, Hermann Brenner, Jenny Chang-Claude, Elizabeth Alwers, Titus J Brinker, Firas Khader, Daniel Truhn, Nadine T Gaisa, Peter BoorMichael Hoffmeister, Volkmar Schulz, Jakob Nikolas Kather^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Artificial intelligence (AI) can extract visual information from histopathological slides and yield biological insight and clinical biomarkers. Whole slide images are cut into thousands of tiles and classification problems are often weakly-supervised: the ground truth is only known for the slide, not for every single tile. In classical weakly-supervised analysis pipelines, all tiles inherit the slide label while in multiple-instance learning (MIL), only bags of tiles inherit the label. However, it is still unclear how these widely used but markedly different approaches perform relative to each other. We implemented and systematically compared six methods in six clinically relevant end-to-end prediction tasks using data from N=2980 patients for training with rigorous external validation. We tested three classical weakly-supervised approaches with convolutional neural networks and vision transformers (ViT) and three MIL-based approaches with and without an additional attention module. Our results empirically demonstrate that histological tumor subtyping of renal cell carcinoma is an easy task in which all approaches achieve an area under the receiver operating curve (AUROC) of above 0.9. In contrast, we report significant performance differences for clinically relevant tasks of mutation prediction in colorectal, gastric, and bladder cancer. In these mutation prediction tasks, classical weakly-supervised workflows outperformed MIL-based weakly-supervised methods for mutation prediction, which is surprising given their simplicity. This shows that new end-to-end image analysis pipelines in computational pathology should be compared to classical weakly-supervised methods. Also, these findings motivate the development of new methods which combine the elegant assumptions of MIL with the empirically observed higher performance of classical weakly-supervised approaches. We make all source codes publicly available at https://github.com/KatherLab/HIA, allowing easy application of all methods to any similar task.

Original language	English
Article number	102474
Number of pages	15
Journal	Medical Image Analysis
Volume	79
Early online date	4 May 2022
DOIs	https://doi.org/10.1016/j.media.2022.102474
Publication status	Published - Jul 2022

Keywords

Artificial intelligence
BIOPSIES
COLONOSCOPY
COLORECTAL-CANCER
Computational pathology
Convolutional neural networks
MICROSATELLITE INSTABILITY
Multiple-Instance Learning
NEURAL-NETWORK
PREDICTION
PROSTATE-CANCER
Vision transformers
Weakly-supervised deep learning

Access to Document

10.1016/j.media.2022.102474

1 Erratum / corrigendum / retractions

Erratum to 'Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology' Medical Image Analysis, Volume 79, July 2022, 102474
Ghaffari Laleh, N., Muti, H. S., Loeffler, C. M. L., Echle, A., Saldanha, O. L., Mahmood, F., Lu, M. Y., Trautwein, C., Langer, R., Dislich, B., Buelow, R. D., Grabsch, H. I., Brenner, H., Chang-Claude, J., Alwers, E., Brinker, T. J., Khader, F., Truhn, D., Gaisa, N. T., Boor, P., & 3 othersHoffmeister, M., Schulz, V. & Kather, J. N., Nov 2022, In: Medical Image Analysis. 82, 1 p., 102622.
Research output: Contribution to journal › Erratum / corrigendum / retractions › Academic

Open Access

Cite this

Ghaffari Laleh, N., Muti, H. S., Loeffler, C. M. L., Echle, A., Saldanha, O. L., Mahmood, F., Lu, M. Y., Trautwein, C., Langer, R., Dislich, B., Buelow, R. D., Grabsch, H. I., Brenner, H., Chang-Claude, J., Alwers, E., Brinker, T. J., Khader, F., Truhn, D., Gaisa, N. T., ... Kather, J. N. (2022). Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology. Medical Image Analysis, 79, Article 102474. https://doi.org/10.1016/j.media.2022.102474

@article{894ceb7bda10405eb8c38459ec0486d5,

title = "Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology",

abstract = "Artificial intelligence (AI) can extract visual information from histopathological slides and yield biological insight and clinical biomarkers. Whole slide images are cut into thousands of tiles and classification problems are often weakly-supervised: the ground truth is only known for the slide, not for every single tile. In classical weakly-supervised analysis pipelines, all tiles inherit the slide label while in multiple-instance learning (MIL), only bags of tiles inherit the label. However, it is still unclear how these widely used but markedly different approaches perform relative to each other. We implemented and systematically compared six methods in six clinically relevant end-to-end prediction tasks using data from N=2980 patients for training with rigorous external validation. We tested three classical weakly-supervised approaches with convolutional neural networks and vision transformers (ViT) and three MIL-based approaches with and without an additional attention module. Our results empirically demonstrate that histological tumor subtyping of renal cell carcinoma is an easy task in which all approaches achieve an area under the receiver operating curve (AUROC) of above 0.9. In contrast, we report significant performance differences for clinically relevant tasks of mutation prediction in colorectal, gastric, and bladder cancer. In these mutation prediction tasks, classical weakly-supervised workflows outperformed MIL-based weakly-supervised methods for mutation prediction, which is surprising given their simplicity. This shows that new end-to-end image analysis pipelines in computational pathology should be compared to classical weakly-supervised methods. Also, these findings motivate the development of new methods which combine the elegant assumptions of MIL with the empirically observed higher performance of classical weakly-supervised approaches. We make all source codes publicly available at https://github.com/KatherLab/HIA, allowing easy application of all methods to any similar task.",

keywords = "Artificial intelligence, BIOPSIES, COLONOSCOPY, COLORECTAL-CANCER, Computational pathology, Convolutional neural networks, MICROSATELLITE INSTABILITY, Multiple-Instance Learning, NEURAL-NETWORK, PREDICTION, PROSTATE-CANCER, Vision transformers, Weakly-supervised deep learning",

author = "{Ghaffari Laleh}, Narmin and Muti, {Hannah Sophie} and Loeffler, {Chiara Maria Lavinia} and Amelie Echle and Saldanha, {Oliver Lester} and Faisal Mahmood and Lu, {Ming Y} and Christian Trautwein and Rupert Langer and Bastian Dislich and Buelow, {Roman D} and Grabsch, {Heike Irmgard} and Hermann Brenner and Jenny Chang-Claude and Elizabeth Alwers and Brinker, {Titus J} and Firas Khader and Daniel Truhn and Gaisa, {Nadine T} and Peter Boor and Michael Hoffmeister and Volkmar Schulz and Kather, {Jakob Nikolas}",

note = "Copyright {\textcopyright} 2022. Published by Elsevier B.V.",

year = "2022",

month = jul,

doi = "10.1016/j.media.2022.102474",

language = "English",

volume = "79",

journal = "Medical Image Analysis",

issn = "1361-8415",

publisher = "Elsevier",

}

Ghaffari Laleh, N, Muti, HS, Loeffler, CML, Echle, A, Saldanha, OL, Mahmood, F, Lu, MY, Trautwein, C, Langer, R, Dislich, B, Buelow, RD, Grabsch, HI, Brenner, H, Chang-Claude, J, Alwers, E, Brinker, TJ, Khader, F, Truhn, D, Gaisa, NT, Boor, P, Hoffmeister, M, Schulz, V & Kather, JN 2022, 'Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology', Medical Image Analysis, vol. 79, 102474. https://doi.org/10.1016/j.media.2022.102474

TY - JOUR

T1 - Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology

AU - Ghaffari Laleh, Narmin

AU - Muti, Hannah Sophie

AU - Loeffler, Chiara Maria Lavinia

AU - Echle, Amelie

AU - Saldanha, Oliver Lester

AU - Mahmood, Faisal

AU - Lu, Ming Y

AU - Trautwein, Christian

AU - Langer, Rupert

AU - Dislich, Bastian

AU - Buelow, Roman D

AU - Grabsch, Heike Irmgard

AU - Brenner, Hermann

AU - Chang-Claude, Jenny

AU - Alwers, Elizabeth

AU - Brinker, Titus J

AU - Khader, Firas

AU - Truhn, Daniel

AU - Gaisa, Nadine T

AU - Boor, Peter

AU - Hoffmeister, Michael

AU - Schulz, Volkmar

AU - Kather, Jakob Nikolas

PY - 2022/7

Y1 - 2022/7

N2 - Artificial intelligence (AI) can extract visual information from histopathological slides and yield biological insight and clinical biomarkers. Whole slide images are cut into thousands of tiles and classification problems are often weakly-supervised: the ground truth is only known for the slide, not for every single tile. In classical weakly-supervised analysis pipelines, all tiles inherit the slide label while in multiple-instance learning (MIL), only bags of tiles inherit the label. However, it is still unclear how these widely used but markedly different approaches perform relative to each other. We implemented and systematically compared six methods in six clinically relevant end-to-end prediction tasks using data from N=2980 patients for training with rigorous external validation. We tested three classical weakly-supervised approaches with convolutional neural networks and vision transformers (ViT) and three MIL-based approaches with and without an additional attention module. Our results empirically demonstrate that histological tumor subtyping of renal cell carcinoma is an easy task in which all approaches achieve an area under the receiver operating curve (AUROC) of above 0.9. In contrast, we report significant performance differences for clinically relevant tasks of mutation prediction in colorectal, gastric, and bladder cancer. In these mutation prediction tasks, classical weakly-supervised workflows outperformed MIL-based weakly-supervised methods for mutation prediction, which is surprising given their simplicity. This shows that new end-to-end image analysis pipelines in computational pathology should be compared to classical weakly-supervised methods. Also, these findings motivate the development of new methods which combine the elegant assumptions of MIL with the empirically observed higher performance of classical weakly-supervised approaches. We make all source codes publicly available at https://github.com/KatherLab/HIA, allowing easy application of all methods to any similar task.

AB - Artificial intelligence (AI) can extract visual information from histopathological slides and yield biological insight and clinical biomarkers. Whole slide images are cut into thousands of tiles and classification problems are often weakly-supervised: the ground truth is only known for the slide, not for every single tile. In classical weakly-supervised analysis pipelines, all tiles inherit the slide label while in multiple-instance learning (MIL), only bags of tiles inherit the label. However, it is still unclear how these widely used but markedly different approaches perform relative to each other. We implemented and systematically compared six methods in six clinically relevant end-to-end prediction tasks using data from N=2980 patients for training with rigorous external validation. We tested three classical weakly-supervised approaches with convolutional neural networks and vision transformers (ViT) and three MIL-based approaches with and without an additional attention module. Our results empirically demonstrate that histological tumor subtyping of renal cell carcinoma is an easy task in which all approaches achieve an area under the receiver operating curve (AUROC) of above 0.9. In contrast, we report significant performance differences for clinically relevant tasks of mutation prediction in colorectal, gastric, and bladder cancer. In these mutation prediction tasks, classical weakly-supervised workflows outperformed MIL-based weakly-supervised methods for mutation prediction, which is surprising given their simplicity. This shows that new end-to-end image analysis pipelines in computational pathology should be compared to classical weakly-supervised methods. Also, these findings motivate the development of new methods which combine the elegant assumptions of MIL with the empirically observed higher performance of classical weakly-supervised approaches. We make all source codes publicly available at https://github.com/KatherLab/HIA, allowing easy application of all methods to any similar task.

KW - Artificial intelligence

KW - BIOPSIES

KW - COLONOSCOPY

KW - COLORECTAL-CANCER

KW - Computational pathology

KW - Convolutional neural networks

KW - MICROSATELLITE INSTABILITY

KW - Multiple-Instance Learning

KW - NEURAL-NETWORK

KW - PREDICTION

KW - PROSTATE-CANCER

KW - Vision transformers

KW - Weakly-supervised deep learning

U2 - 10.1016/j.media.2022.102474

DO - 10.1016/j.media.2022.102474

M3 - Article

C2 - 35588568

SN - 1361-8415

VL - 79

JO - Medical Image Analysis

JF - Medical Image Analysis

M1 - 102474

ER -

Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology

Abstract

Keywords

Access to Document

Research output

Erratum to 'Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology' Medical Image Analysis, Volume 79, July 2022, 102474

Cite this