Robust Estimation of Breast Cancer Incidence Risk in Presence of Incomplete or Inaccurate Information

Siva Teja Kakileti; Geetha Manjunath; Andre Dekker; Leonard Wee

doi:10.31557/APJCP.2020.21.8.2307

Robust Estimation of Breast Cancer Incidence Risk in Presence of Incomplete or Inaccurate Information

Siva Teja Kakileti^*, Geetha Manjunath, Andre Dekker, Leonard Wee

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

PURPOSE: To evaluate the robustness of multiple machine learning classifiers for breast cancer risk estimation in the presence of incomplete or inaccurate information.

DATA AND METHODS: Open data for this study was obtained from the BCSC Data Resource (http://breastscreening.cancer.gov/). We conducted two ablation-type experiments to compare the robustness of different classifiers where we randomly switched known information to missing with a missing probability of pm in one experiment, and randomly corrupted the existing information with a probability of pc in another experiment. We considered three prominent machine-learning classifiers such as Logistic regression (LR), Random Forests (RF) and a custom Neural Network (NN) architecture and compared their degradation of discrimination performance as a function of increasing probability of missing or inaccurate data.

RESULTS: LR, RF and custom NN resulted in an Area Under Curve (AUC) of 0.645, 0.643 and 0.649, respectively, on a test set with 500,000 total observations. When we manipulated the data by varying probabilities pm and pc from 0 to 1, NN resulted in better performance in terms of AUC compared to RF and LR as long as less than half the data was missing/inaccurate (that is, for values of pm < 0.5 and pc < 0.5). However, for missing (pm) or corruption (pc) probabilities above 0.5, LR gave similar performance as the custom NN. RF resulted in overall poorer performance when the data had additional missing or incorrect entries.

CONCLUSION: In cases where the input information is missing or inaccurate, our experiments show that the proposed custom NN provides reliable risk estimates in medical datasets like BCSC. These results are particularly important in health care applications where not every attribute of the individual participant might be available.<br />.

Original language	English
Pages (from-to)	2307-2313
Number of pages	7
Journal	Asian Pacific Journal of Cancer Prevention
Volume	21
Issue number	8
DOIs	https://doi.org/10.31557/APJCP.2020.21.8.2307
Publication status	Published - 1 Aug 2020

Access to Document

10.31557/APJCP.2020.21.8.2307Licence: CC BY

Cite this

@article{70419e6054994a608db41264ee860ad7,

title = "Robust Estimation of Breast Cancer Incidence Risk in Presence of Incomplete or Inaccurate Information",

abstract = "PURPOSE: To evaluate the robustness of multiple machine learning classifiers for breast cancer risk estimation in the presence of incomplete or inaccurate information.DATA AND METHODS: Open data for this study was obtained from the BCSC Data Resource (http://breastscreening.cancer.gov/). We conducted two ablation-type experiments to compare the robustness of different classifiers where we randomly switched known information to missing with a missing probability of pm in one experiment, and randomly corrupted the existing information with a probability of pc in another experiment. We considered three prominent machine-learning classifiers such as Logistic regression (LR), Random Forests (RF) and a custom Neural Network (NN) architecture and compared their degradation of discrimination performance as a function of increasing probability of missing or inaccurate data.RESULTS: LR, RF and custom NN resulted in an Area Under Curve (AUC) of 0.645, 0.643 and 0.649, respectively, on a test set with 500,000 total observations. When we manipulated the data by varying probabilities pm and pc from 0 to 1, NN resulted in better performance in terms of AUC compared to RF and LR as long as less than half the data was missing/inaccurate (that is, for values of pm < 0.5 and pc < 0.5). However, for missing (pm) or corruption (pc) probabilities above 0.5, LR gave similar performance as the custom NN. RF resulted in overall poorer performance when the data had additional missing or incorrect entries.CONCLUSION: In cases where the input information is missing or inaccurate, our experiments show that the proposed custom NN provides reliable risk estimates in medical datasets like BCSC. These results are particularly important in health care applications where not every attribute of the individual participant might be available..",

author = "Kakileti, {Siva Teja} and Geetha Manjunath and Andre Dekker and Leonard Wee",

year = "2020",

month = aug,

day = "1",

doi = "10.31557/APJCP.2020.21.8.2307",

language = "English",

volume = "21",

pages = "2307--2313",

journal = "Asian Pacific Journal of Cancer Prevention",

issn = "1513-7368",

publisher = "Asian Pacific Organization for Cancer Prevention",

number = "8",

}

TY - JOUR

T1 - Robust Estimation of Breast Cancer Incidence Risk in Presence of Incomplete or Inaccurate Information

AU - Kakileti, Siva Teja

AU - Manjunath, Geetha

AU - Dekker, Andre

AU - Wee, Leonard

PY - 2020/8/1

Y1 - 2020/8/1

N2 - PURPOSE: To evaluate the robustness of multiple machine learning classifiers for breast cancer risk estimation in the presence of incomplete or inaccurate information.DATA AND METHODS: Open data for this study was obtained from the BCSC Data Resource (http://breastscreening.cancer.gov/). We conducted two ablation-type experiments to compare the robustness of different classifiers where we randomly switched known information to missing with a missing probability of pm in one experiment, and randomly corrupted the existing information with a probability of pc in another experiment. We considered three prominent machine-learning classifiers such as Logistic regression (LR), Random Forests (RF) and a custom Neural Network (NN) architecture and compared their degradation of discrimination performance as a function of increasing probability of missing or inaccurate data.RESULTS: LR, RF and custom NN resulted in an Area Under Curve (AUC) of 0.645, 0.643 and 0.649, respectively, on a test set with 500,000 total observations. When we manipulated the data by varying probabilities pm and pc from 0 to 1, NN resulted in better performance in terms of AUC compared to RF and LR as long as less than half the data was missing/inaccurate (that is, for values of pm < 0.5 and pc < 0.5). However, for missing (pm) or corruption (pc) probabilities above 0.5, LR gave similar performance as the custom NN. RF resulted in overall poorer performance when the data had additional missing or incorrect entries.CONCLUSION: In cases where the input information is missing or inaccurate, our experiments show that the proposed custom NN provides reliable risk estimates in medical datasets like BCSC. These results are particularly important in health care applications where not every attribute of the individual participant might be available..

AB - PURPOSE: To evaluate the robustness of multiple machine learning classifiers for breast cancer risk estimation in the presence of incomplete or inaccurate information.DATA AND METHODS: Open data for this study was obtained from the BCSC Data Resource (http://breastscreening.cancer.gov/). We conducted two ablation-type experiments to compare the robustness of different classifiers where we randomly switched known information to missing with a missing probability of pm in one experiment, and randomly corrupted the existing information with a probability of pc in another experiment. We considered three prominent machine-learning classifiers such as Logistic regression (LR), Random Forests (RF) and a custom Neural Network (NN) architecture and compared their degradation of discrimination performance as a function of increasing probability of missing or inaccurate data.RESULTS: LR, RF and custom NN resulted in an Area Under Curve (AUC) of 0.645, 0.643 and 0.649, respectively, on a test set with 500,000 total observations. When we manipulated the data by varying probabilities pm and pc from 0 to 1, NN resulted in better performance in terms of AUC compared to RF and LR as long as less than half the data was missing/inaccurate (that is, for values of pm < 0.5 and pc < 0.5). However, for missing (pm) or corruption (pc) probabilities above 0.5, LR gave similar performance as the custom NN. RF resulted in overall poorer performance when the data had additional missing or incorrect entries.CONCLUSION: In cases where the input information is missing or inaccurate, our experiments show that the proposed custom NN provides reliable risk estimates in medical datasets like BCSC. These results are particularly important in health care applications where not every attribute of the individual participant might be available..

U2 - 10.31557/APJCP.2020.21.8.2307

DO - 10.31557/APJCP.2020.21.8.2307

M3 - Article

C2 - 32856859

SN - 1513-7368

VL - 21

SP - 2307

EP - 2313

JO - Asian Pacific Journal of Cancer Prevention

JF - Asian Pacific Journal of Cancer Prevention

IS - 8

ER -