3 top tools to De-Identify PHI in Healthcare Datasets

3 top tools we recommend to De-Identify PHI in Healthcare Datasets

Under US law, Protected Health Information, or PHI refers to any information pertaining to health state, health care, and associated payments. Usually, PHI is created or collected by a Healthcare Services Provider (clinics and hospitals) or Payers (insurance companies).

The U.S. Health Insurance Portability and Accountability Act (HIPAA) states that the following 18 identifiers must be held confidentially.

  1. Names
  2. All geographical identifiers smaller than the name of a state
  3. Dates (other than year) directly related to an individual
  4. Phone Numbers
  5. Fax numbers
  6. Email addresses
  7. Social Security numbers
  8. Medical record numbers
  9. Health insurance beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers, including license plate numbers;
  13. Device identifiers and serial numbers;
  14. Web Uniform Resource Locators (URLs)
  15. Internet Protocol (IP) address numbers
  16. Biometric identifiers, including finger, retinal, and voiceprints
  17. Full face photographic images and any comparable images
  18. Any other unique identifying number, character, or code except the unique code assigned by the investigator to code the data

The need for PHI De-identification

Safeguarding PHI and ePHI are important to ensure privacy risks are mitigated. The de-identification of personal information mitigates privacy risks to individuals while also reducing the organization’s exposure to breach risk (e.g., reputational damage and remediation costs). Further, personal information should be retained only as long as necessary to fulfill the stated purposes or as required by law or regulations.

If any organization is considering the de-identification of personal information, it is recommended to look at the HIPAA Privacy Rule’s standard for the de-identification of protected health information. This is found in Section 164.514(a) of the rule. Under this standard, health information is not deemed individually identifiable if it does not identify an individual.

EHR and EMR datasets usually contain PHI data. Healthcare organizations and their business associates that want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but “de-identification” of protected health information means HIPAA Privacy Rule restrictions no longer apply. These datasets are shared with Data Scientists like us for analysis, to unlock insights and trends.

At Ideas2IT, we work with healthcare and health-tech clients like Roche, Netsmart, uLab Systems, Mayo Clinic, and Grapefruit Health. And we have come to use a few tools regularly to de-identify PHI data from healthcare datasets. Let’s take a brief look at them in this blog.

Methods of De-Identification

All methods of de-identification of PHIs do not ensure, with certainty, that all risks of re-identification are removed. Most methods try to reduce this risk to as small an extent as possible or within an acceptable range. HIPAA-compliant de-identification of protected health information is possible using two methods:

  1. Safe Harbor
  2. Expert Determination

Safe Harbor

The first HIPAA-compliant way to de-identify protected health information is to remove specific identifiers from the data set. The identifiable data that must be removed are:

  • Names
  • Geographic subdivisions smaller than a state
  • All elements of dates (except year) related to an individual (including admission and discharge dates, birthdate, date of death, all ages over 89 years old, and elements of dates (including year) that are indicative of age)
  • Telephone, cellphone, and fax numbers
  • Email addresses
  • IP addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Device identifiers and serial numbers
  • Certificate/license numbers
  • Account numbers
  • Vehicle identifiers and serial numbers including license plates
  • Website URLs
  • Full face photos and comparable images
  • Biometric identifiers (including finger and voice prints)
  • Any unique identifying numbers, characteristics or code

Expert Determination

This method of de-identification of protected health information requires a HIPAA-covered entity or business associate to obtain an opinion from a qualified statistical expert that the risk of re-identifying an individual from the data set is very small. Expert Determination methodologies exist so that critical data can be used while still protecting patient privacy. In such cases, the methods used to make that determination and justification of the expert’s opinion must be documented and retained by the covered entity or business associate and made available to regulators in the event of an audit or investigation. HIPAA does not define the level of risk of re-identification other than to say it should be ‘very small’. The expert should define ‘very small’ in relation to the context of the data set.

While there is not currently one standard method for de-identification, there are four major organizations that have adopted the Expert Determination standard. They are:

  1. The Institute of Medicine (IOM)
  2. The Health Information Trust Alliance (HITRUST)
  3. The Pharmaceutical Users Software Exchange (PhUSE) and
  4. The Council of Canadian Academies.

These standards help guide organizations through accessing, storing, and exchanging personal information. These frameworks are a major step in clarifying current methodologies.

Tools we recommend for PHI De-identification

Google Healthcare API

De-identification in Google Healthcare API works at the following levels:

  • At the Dataset Level: De-identification occurs on all data in DICOM stores and FHIR stores of the dataset. If a dataset contains both DICOM instances and FHIR resources, you can de-identify all of the instances and resources at the same time.
  • At the FHIR Store Level: Healthcare organizations and their business associates who want to share protected health information must do so in accordance with the HIPAA Privacy Rule, which limits the possible uses and disclosures of PHI, but de-identification of protected health information means HIPAA Privacy Rule restrictions no longer apply. De-identification occurs on all data in a specific FHIR store in a dataset. At the DICOM store level. De-identification occurs on all data in a specific DICOM store in a dataset.

We suggest checking the documentation on how the APIs are called for dataset level, FHIR store, and the DICOM level.

De-identification doesn’t impact the original dataset, FHIR store, DICOM store, or the original data. Depending on how you configure the de-identification, the operation behaves as follows:

  • If you are de-identifying data at the dataset level, de-identified copies of the original data are written to a new dataset called the destination dataset.
  • If you are de-identifying data at the DICOM or FHIR store level, de-identified copies of the original data are written to an existing DICOM or FHIR store in an existing dataset. The output DICOM store and FHIR store are called the destination DICOM store and destination FHIR store, respectively.

The source dataset, FHIR store, or DICOM store and the destination dataset, FHIR store, or DICOM store must reside in the same Google Cloud location. De-identifying data across multiple Google Cloud locations are not supported.

BERT-based Clinical Deidentification

There are several ways that can be used for the Safe Harbor method of identification of the PHI in biomedical corpora. Here we will discuss a specific model that can be used for the same. BERT models are based on Transformers, a Deep Learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. It can read from both directions i.e from left to right and right to left.

BioBERT is a BERT-based pre-trained model which is trained on several medical corpora like journals, medical articles, publications of medical research, etc. The vocabulary of the pre-trained model is fairly specific to biomedical jargon. Following is the feature of the BioBERT model:

  1. Simple architecture based on bidirectional transformers
  2. Single output layer based on the representations from its last layer to compute only token level BIO2 probabilities
  3. BioBERT directly learns WordPiece embeddings during pre-training and fine-tuning.

Here, let’s take a BioBERT model which is a pre-trained model with context-aware word embeddings, to classify PHI categories in a named entity recognition task. The model is fine-tuned and trained on the I2B2 2014, a fully tagged dataset in medical research.

The identified PHI NER from the model could then be cleaned or removed to mask the identifiers.

The following results are selfly computed:

PHI - ACCURACY

  • AGE: 99%
  • CITY: 82%
  • COUNTRY: 66%
  • DATE: 98%
  • DEVICE: 0%
  • DOCTOR: 93%
  • EMAIL: 0%
  • FAX: 0%
  • HOSPITAL: 79%
  • ID NUMBER: 85%
  • MEDICAL RECORD NUMBER: 99%
  • ORGANIZATION: 40%
  • PATIENT NAME: 89%
  • PHONE: 96%
  • PROFESSION: 79%
  • STATE: 84%
  • STREET: 98%
  • USERNAME: 96%
  • ZIPCODE: 99%
  • UNKNOWN: 97%

The ones in the above results with 0% accuracy imply that no representation of such a PHI exists in the model. All other PHIs are fairly well detected. Apart from the textual information regarding patient names, addresses, and other PHIs, the model does a good job in detecting the specific date, age, and numeric data and classifies it as PHI. In that case, specific requirements which mandate the removal of specific information could be also eliminated in the process.

AWS Comprehend Medical

Amazon Comprehend Medical detects and returns useful information in unstructured clinical text such as physician’s notes, discharge summaries, test results, and case notes. Amazon Comprehend Medical uses natural language processing (NLP) models to detect entities, which are textual references to medical information such as medical conditions, medications, or Protected Health Information(PHI).

Use the DetectPHI operation to detect Protected Health Information (PHI) data in the clinical text being examined.

The following PHIs are detected through DetectPHI in AWS Comprehend:

PHIs detected in AWS Comprehend

Medical Protected Health Information Data Extraction and Identification (PHId) API of Amazon Comprehend is priced at $0.0014 per 100 characters of text in a request.

Conclusion

Ensuring that specific data elements are removed from personal data sets will help ensure that the personal information retained does not allow for the identification of an individual to occur. In short, the de-identification of personal information is a very important component of protecting PII and mitigating privacy risks.

To discover more intel regarding the advancements in the de-identification of PHI in healthcare datasets, we invite you to connect with one of our specialists today.

--

--

--

We’re a product engineering firm. Our work is cutting-edge, be it in AI-ML, Cloud, DevOps, or IIoT, for an enviable set of clients. Visit www.ideas2it.com.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Ronin Network has been compromised for $625 Million making it possibly the largest Crypto Hack ever

Gametree Whitepaper Update v1.3

Why SSL, TLS and Ciphers for secure web-services ?

{UPDATE} Noob Vs Pro 2 Hack Free Resources Generator

Copy Encrypted AMI across AWS Accounts

Meet the Crypto Volatility Index (CVI) GitBook

What is Cyber Security Mesh? Its Challenges, Benefits, and Applications

iBG Finance

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ideas2IT

Ideas2IT

We’re a product engineering firm. Our work is cutting-edge, be it in AI-ML, Cloud, DevOps, or IIoT, for an enviable set of clients. Visit www.ideas2it.com.

More from Medium

Why I never use self-reported data, such as surveys

Finding a needle in the haystack: Follow up on OpenScienceKE research paper

Data Science Intern

What is Cost of Transportation to Supply Chain Management?