De-Identification of Protected Health Information PHI from Free Text in Medical Records.

Geetha Mahadevaiah; M.S. Dinesh; Sana Moin; Andre Dekker

doi:10.5121/ijsptm.2019.8201

De-Identification of Protected Health Information PHI from Free Text in Medical Records.

Geetha Mahadevaiah^*, M.S. Dinesh, Sana Moin, Andre Dekker

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Medical health records often contain clinical investigations results and critical information regarding
patient health conditions. In these medical records, along with patient health information, patient
Protected Health Information (PHI) such as names, locations and date information can co-exist. As per
Health Insurance Portability and Accountability Act (HIPAA), before sharing the medical records with
researchers and others, all types of PHI information needs to be de-identified. Manual de-identification
through human annotators is laborious and error prone, hence, a reliable automated de-identification
system is need of the hour.
In this work, various state of the art techniques for de-identification of patient notes in electronic health
records were analyzed for their performance, based on the performance quoted in the literature,
NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results from
all four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set.

Original language	English
Pages (from-to)	1-11
Number of pages	11
Journal	International Journal of Security, Privacy and Trust Management
Volume	8
Issue number	1/2
DOIs	https://doi.org/10.5121/ijsptm.2019.8201
Publication status	Published - May 2019

Access to Document

10.5121/ijsptm.2019.8201

Cite this

@article{40ce80eadc624858a34e5c6245c558ab,

title = "De-Identification of Protected Health Information PHI from Free Text in Medical Records.",

abstract = "Medical health records often contain clinical investigations results and critical information regardingpatient health conditions. In these medical records, along with patient health information, patientProtected Health Information (PHI) such as names, locations and date information can co-exist. As perHealth Insurance Portability and Accountability Act (HIPAA), before sharing the medical records withresearchers and others, all types of PHI information needs to be de-identified. Manual de-identificationthrough human annotators is laborious and error prone, hence, a reliable automated de-identificationsystem is need of the hour.In this work, various state of the art techniques for de-identification of patient notes in electronic healthrecords were analyzed for their performance, based on the performance quoted in the literature,NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results fromall four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set. ",

author = "Geetha Mahadevaiah and M.S. Dinesh and Sana Moin and Andre Dekker",

year = "2019",

month = may,

doi = "10.5121/ijsptm.2019.8201",

language = "English",

volume = "8",

pages = "1--11",

journal = "International Journal of Security, Privacy and Trust Management",

issn = "2319-4103",

number = "1/2",

}

TY - JOUR

T1 - De-Identification of Protected Health Information PHI from Free Text in Medical Records.

AU - Mahadevaiah, Geetha

AU - Dinesh, M.S.

AU - Moin, Sana

AU - Dekker, Andre

PY - 2019/5

Y1 - 2019/5

N2 - Medical health records often contain clinical investigations results and critical information regardingpatient health conditions. In these medical records, along with patient health information, patientProtected Health Information (PHI) such as names, locations and date information can co-exist. As perHealth Insurance Portability and Accountability Act (HIPAA), before sharing the medical records withresearchers and others, all types of PHI information needs to be de-identified. Manual de-identificationthrough human annotators is laborious and error prone, hence, a reliable automated de-identificationsystem is need of the hour.In this work, various state of the art techniques for de-identification of patient notes in electronic healthrecords were analyzed for their performance, based on the performance quoted in the literature,NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results fromall four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set.

AB - Medical health records often contain clinical investigations results and critical information regardingpatient health conditions. In these medical records, along with patient health information, patientProtected Health Information (PHI) such as names, locations and date information can co-exist. As perHealth Insurance Portability and Accountability Act (HIPAA), before sharing the medical records withresearchers and others, all types of PHI information needs to be de-identified. Manual de-identificationthrough human annotators is laborious and error prone, hence, a reliable automated de-identificationsystem is need of the hour.In this work, various state of the art techniques for de-identification of patient notes in electronic healthrecords were analyzed for their performance, based on the performance quoted in the literature,NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results fromall four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set.

UR - https://aircconline.com/ijsptm/V8N2/8219ijsptm01.pdf

U2 - 10.5121/ijsptm.2019.8201

DO - 10.5121/ijsptm.2019.8201

M3 - Article

SN - 2319-4103

VL - 8

SP - 1

EP - 11

JO - International Journal of Security, Privacy and Trust Management

JF - International Journal of Security, Privacy and Trust Management

IS - 1/2

ER -

De-Identification of Protected Health Information PHI from Free Text in Medical Records.

Abstract

Access to Document

Other files and links

Cite this