Privacy preserving distributed learning classifiers-Sequential learning with small sets of data

F. Zerka; V. Urovi; F. Bottari; R.T.H. Leijenaar; S. Walsh; H. Gabrani-Juma; M. Gueuning; A. Vaidyanathan; W. Vos; M. Occhipinti; H.C. Woodruff; M. Dumontier; P. Lambin

doi:10.1016/j.compbiomed.2021.104716

Privacy preserving distributed learning classifiers-Sequential learning with small sets of data

F. Zerka^*, V. Urovi, F. Bottari, R.T.H. Leijenaar, S. Walsh, H. Gabrani-Juma, M. Gueuning, A. Vaidyanathan, W. Vos, M. Occhipinti, H.C. Woodruff, M. Dumontier, P. Lambin

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Background: Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data to train AI models. Furthermore, we evaluated the capacity of such models to achieve equivalent performance when compared to models trained with the same data over a single centralized database. Methods: We propose a privacy preserving distributed learning framework, learning sequentially from each dataset. The framework is applied to three machine learning algorithms: Logistic Regression, Support Vector Machines (SVM), and Perceptron. The models were evaluated using four open-source datasets (Breast cancer, Indian liver, NSCLC-Radiomics dataset, and Stage III NSCLC). Findings: The proposed framework ensured a comparable predictive performance against a centralized learning approach. Pairwise DeLong tests showed no significant difference between the compared pairs for each dataset. Interpretation: Distributed learning contributes to preserve medical data privacy. We foresee this technology will increase the number of collaborative opportunities to develop robust AI, becoming the default solution in scenarios where collecting enough data from a single reliable source is logistically impossible. Distributed sequential learning provides privacy persevering means for institutions with small but clinically valuable datasets to collaboratively train predictive AI while preserving the privacy of their patients. Such models perform similarly to models that are built on a larger central dataset.

Original language	English
Article number	104716
Number of pages	9
Journal	Computers in Biology and Medicine
Volume	136
DOIs	https://doi.org/10.1016/j.compbiomed.2021.104716
Publication status	Published - 1 Sept 2021

Keywords

Distributed learning
Sequential learning
Rare disease
Medical data privacy
SURVIVAL PREDICTION
CANCER-PATIENTS
HEALTH-CARE
MODEL
BLOCKCHAIN

Access to Document

10.1016/j.compbiomed.2021.104716Licence: CC BY

Cite this

Zerka, F., Urovi, V., Bottari, F., Leijenaar, R. T. H., Walsh, S., Gabrani-Juma, H., Gueuning, M., Vaidyanathan, A., Vos, W., Occhipinti, M., Woodruff, H. C., Dumontier, M., & Lambin, P. (2021). Privacy preserving distributed learning classifiers-Sequential learning with small sets of data. Computers in Biology and Medicine, 136, Article 104716. https://doi.org/10.1016/j.compbiomed.2021.104716

@article{9e2b4ca4a26f476e896a057436db6a7d,

title = "Privacy preserving distributed learning classifiers-Sequential learning with small sets of data",

abstract = "Background: Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data to train AI models. Furthermore, we evaluated the capacity of such models to achieve equivalent performance when compared to models trained with the same data over a single centralized database. Methods: We propose a privacy preserving distributed learning framework, learning sequentially from each dataset. The framework is applied to three machine learning algorithms: Logistic Regression, Support Vector Machines (SVM), and Perceptron. The models were evaluated using four open-source datasets (Breast cancer, Indian liver, NSCLC-Radiomics dataset, and Stage III NSCLC). Findings: The proposed framework ensured a comparable predictive performance against a centralized learning approach. Pairwise DeLong tests showed no significant difference between the compared pairs for each dataset. Interpretation: Distributed learning contributes to preserve medical data privacy. We foresee this technology will increase the number of collaborative opportunities to develop robust AI, becoming the default solution in scenarios where collecting enough data from a single reliable source is logistically impossible. Distributed sequential learning provides privacy persevering means for institutions with small but clinically valuable datasets to collaboratively train predictive AI while preserving the privacy of their patients. Such models perform similarly to models that are built on a larger central dataset.",

keywords = "Distributed learning, Sequential learning, Rare disease, Medical data privacy, SURVIVAL PREDICTION, CANCER-PATIENTS, HEALTH-CARE, MODEL, BLOCKCHAIN",

author = "F. Zerka and V. Urovi and F. Bottari and R.T.H. Leijenaar and S. Walsh and H. Gabrani-Juma and M. Gueuning and A. Vaidyanathan and W. Vos and M. Occhipinti and H.C. Woodruff and M. Dumontier and P. Lambin",

note = "Funding Information: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015, n° 694812 - Hypoximmuno). This research is also supported by the Dutch Technology Foundation STW (grant n° P14-19 Radiomics STRaTegy), which is the applied science division of NWO, Aspasia NWO (grant n°91716421) and the Technology Program of the Ministry of Economic Affairs. Authors also acknowledge financial support from SME Phase 2 (RAIL - n° 673780), EUROSTARS (DART - n° E10116, DECIDE - n° E11541), the European Program PREDICT - ITN - n° 766276), TRANSCAN Joint Transnational Call 2016 (JTC2016 “CLEARLY” - n° UM 2017–8295), Interreg V-A Euregio Meuse-Rhine (“Euradiomics” - n° EMR4), DRAGON (Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement n° 101005122), EuCanImage (European Union Horizon 2020 research and innovation program under grant agreement n° 952103), and DEEP-MAM (Eurostar grant n° E12931).The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Fadila Zerka, Akshayaa Vaidyanathan, Fabio Bottari, Martin Gueuning, Hanif Gabrani-Juma, Mariaelena Occhipinti are salaried employees/receive renumeration from Radiomics (Oncoradiomics SA). Dr Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Varian medical, Radiomics (Oncoradiomics SA), ptTheragnostic/DNAmito, Health Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/external grant writing fee and/or in kind manpower contribution from Radiomics (Oncoradiomics SA), BHV, Merck, Varian, Elekta, ptTheragnostic and Convert pharmaceuticals. Dr Lambin has shares in the company Radiomics (Oncoradiomics SA), Convert pharmaceuticals SA and The Medical Cloud Company SPRL and is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Radiomics (Oncoradiomics SA) and one issue patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, three non-patented invention (softwares) licensed to ptTheragnostic/DNAmito, Radiomics (Oncoradiomics SA) and Health Innovation Ventures and three non-issues, non licensed patents on Deep Learning-Radiomics and LSRT (N2024482, N2024889, N2024889. Ralph T.H. Leijenaar has shares in the company Radiomics (Oncoradiomics SA) and is co-inventor of an issued patent with royalties on radiomics (PCT/NL2014/050728) licensed to Radiomics (Oncoradiomics SA). Sean Walsh and Wim Vos have shares in the company Radiomics (Oncoradiomics SA). Michel Dumontier has shares in The Medical Cloud Company SPRL. Rest of the co-authors have no known competing financial interests or personal relationships to declare. Funding Information: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015, n° 694812 - Hypoximmuno). This research is also supported by the Dutch Technology Foundation STW (grant n° P14-19 Radiomics STRaTegy), which is the applied science division of NWO , Aspasia NWO (grant n°91716421) and the Technology Program of the Ministry of Economic Affairs . Authors also acknowledge financial support from SME Phase 2 (RAIL - n° 673780), EUROSTARS ( DART - n° E10116, DECIDE - n° E11541), the European Program PREDICT - ITN - n° 766276), TRANSCAN Joint Transnational Call 2016 (JTC2016 “CLEARLY” - n° UM 2017–8295), Interreg V-A Euregio Meuse-Rhine (“Euradiomics” - n° EMR4), DRAGON ( Innovative Medicines Initiative 2 Joint Undertaking ( JU ) under grant agreement n° 101005122), EuCanImage ( European Union Horizon 2020 research and innovation program under grant agreement n° 952103), and DEEP-MAM (Eurostar grant n° E12931). Publisher Copyright: {\textcopyright} 2021 The Author(s)",

year = "2021",

month = sep,

day = "1",

doi = "10.1016/j.compbiomed.2021.104716",

language = "English",

volume = "136",

journal = "Computers in Biology and Medicine",

issn = "0010-4825",

publisher = "Elsevier Science",

}

Zerka, F, Urovi, V, Bottari, F, Leijenaar, RTH, Walsh, S, Gabrani-Juma, H, Gueuning, M, Vaidyanathan, A, Vos, W, Occhipinti, M, Woodruff, HC , Dumontier, M & Lambin, P 2021, 'Privacy preserving distributed learning classifiers-Sequential learning with small sets of data', Computers in Biology and Medicine, vol. 136, 104716. https://doi.org/10.1016/j.compbiomed.2021.104716

TY - JOUR

T1 - Privacy preserving distributed learning classifiers-Sequential learning with small sets of data

AU - Zerka, F.

AU - Urovi, V.

AU - Bottari, F.

AU - Leijenaar, R.T.H.

AU - Walsh, S.

AU - Gabrani-Juma, H.

AU - Gueuning, M.

AU - Vaidyanathan, A.

AU - Vos, W.

AU - Occhipinti, M.

AU - Woodruff, H.C.

AU - Dumontier, M.

AU - Lambin, P.

N1 - Funding Information: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015, n° 694812 - Hypoximmuno). This research is also supported by the Dutch Technology Foundation STW (grant n° P14-19 Radiomics STRaTegy), which is the applied science division of NWO, Aspasia NWO (grant n°91716421) and the Technology Program of the Ministry of Economic Affairs. Authors also acknowledge financial support from SME Phase 2 (RAIL - n° 673780), EUROSTARS (DART - n° E10116, DECIDE - n° E11541), the European Program PREDICT - ITN - n° 766276), TRANSCAN Joint Transnational Call 2016 (JTC2016 “CLEARLY” - n° UM 2017–8295), Interreg V-A Euregio Meuse-Rhine (“Euradiomics” - n° EMR4), DRAGON (Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement n° 101005122), EuCanImage (European Union Horizon 2020 research and innovation program under grant agreement n° 952103), and DEEP-MAM (Eurostar grant n° E12931).The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Fadila Zerka, Akshayaa Vaidyanathan, Fabio Bottari, Martin Gueuning, Hanif Gabrani-Juma, Mariaelena Occhipinti are salaried employees/receive renumeration from Radiomics (Oncoradiomics SA). Dr Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Varian medical, Radiomics (Oncoradiomics SA), ptTheragnostic/DNAmito, Health Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/external grant writing fee and/or in kind manpower contribution from Radiomics (Oncoradiomics SA), BHV, Merck, Varian, Elekta, ptTheragnostic and Convert pharmaceuticals. Dr Lambin has shares in the company Radiomics (Oncoradiomics SA), Convert pharmaceuticals SA and The Medical Cloud Company SPRL and is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Radiomics (Oncoradiomics SA) and one issue patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, three non-patented invention (softwares) licensed to ptTheragnostic/DNAmito, Radiomics (Oncoradiomics SA) and Health Innovation Ventures and three non-issues, non licensed patents on Deep Learning-Radiomics and LSRT (N2024482, N2024889, N2024889. Ralph T.H. Leijenaar has shares in the company Radiomics (Oncoradiomics SA) and is co-inventor of an issued patent with royalties on radiomics (PCT/NL2014/050728) licensed to Radiomics (Oncoradiomics SA). Sean Walsh and Wim Vos have shares in the company Radiomics (Oncoradiomics SA). Michel Dumontier has shares in The Medical Cloud Company SPRL. Rest of the co-authors have no known competing financial interests or personal relationships to declare. Funding Information: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015, n° 694812 - Hypoximmuno). This research is also supported by the Dutch Technology Foundation STW (grant n° P14-19 Radiomics STRaTegy), which is the applied science division of NWO , Aspasia NWO (grant n°91716421) and the Technology Program of the Ministry of Economic Affairs . Authors also acknowledge financial support from SME Phase 2 (RAIL - n° 673780), EUROSTARS ( DART - n° E10116, DECIDE - n° E11541), the European Program PREDICT - ITN - n° 766276), TRANSCAN Joint Transnational Call 2016 (JTC2016 “CLEARLY” - n° UM 2017–8295), Interreg V-A Euregio Meuse-Rhine (“Euradiomics” - n° EMR4), DRAGON ( Innovative Medicines Initiative 2 Joint Undertaking ( JU ) under grant agreement n° 101005122), EuCanImage ( European Union Horizon 2020 research and innovation program under grant agreement n° 952103), and DEEP-MAM (Eurostar grant n° E12931). Publisher Copyright: © 2021 The Author(s)

PY - 2021/9/1

Y1 - 2021/9/1

N2 - Background: Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data to train AI models. Furthermore, we evaluated the capacity of such models to achieve equivalent performance when compared to models trained with the same data over a single centralized database. Methods: We propose a privacy preserving distributed learning framework, learning sequentially from each dataset. The framework is applied to three machine learning algorithms: Logistic Regression, Support Vector Machines (SVM), and Perceptron. The models were evaluated using four open-source datasets (Breast cancer, Indian liver, NSCLC-Radiomics dataset, and Stage III NSCLC). Findings: The proposed framework ensured a comparable predictive performance against a centralized learning approach. Pairwise DeLong tests showed no significant difference between the compared pairs for each dataset. Interpretation: Distributed learning contributes to preserve medical data privacy. We foresee this technology will increase the number of collaborative opportunities to develop robust AI, becoming the default solution in scenarios where collecting enough data from a single reliable source is logistically impossible. Distributed sequential learning provides privacy persevering means for institutions with small but clinically valuable datasets to collaboratively train predictive AI while preserving the privacy of their patients. Such models perform similarly to models that are built on a larger central dataset.

AB - Background: Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data to train AI models. Furthermore, we evaluated the capacity of such models to achieve equivalent performance when compared to models trained with the same data over a single centralized database. Methods: We propose a privacy preserving distributed learning framework, learning sequentially from each dataset. The framework is applied to three machine learning algorithms: Logistic Regression, Support Vector Machines (SVM), and Perceptron. The models were evaluated using four open-source datasets (Breast cancer, Indian liver, NSCLC-Radiomics dataset, and Stage III NSCLC). Findings: The proposed framework ensured a comparable predictive performance against a centralized learning approach. Pairwise DeLong tests showed no significant difference between the compared pairs for each dataset. Interpretation: Distributed learning contributes to preserve medical data privacy. We foresee this technology will increase the number of collaborative opportunities to develop robust AI, becoming the default solution in scenarios where collecting enough data from a single reliable source is logistically impossible. Distributed sequential learning provides privacy persevering means for institutions with small but clinically valuable datasets to collaboratively train predictive AI while preserving the privacy of their patients. Such models perform similarly to models that are built on a larger central dataset.

KW - Distributed learning

KW - Sequential learning

KW - Rare disease

KW - Medical data privacy

KW - SURVIVAL PREDICTION

KW - CANCER-PATIENTS

KW - HEALTH-CARE

KW - MODEL

KW - BLOCKCHAIN

U2 - 10.1016/j.compbiomed.2021.104716

DO - 10.1016/j.compbiomed.2021.104716

M3 - Article

C2 - 34364262

SN - 0010-4825

VL - 136

JO - Computers in Biology and Medicine

JF - Computers in Biology and Medicine

M1 - 104716

ER -