De-identifying (Anonymizing) Data for Development (using Python)
When working with Personally Identifying Information (PII) or Protected Health Information (PHI), it’s often necessary to “de-identify” that information, particularly if you are displaying it in any way (for reporting or displaying research results). You want to ensure that the data you’re showing can’t be used to identify any given individual. In my case, I wanted to also de-identify the data being used within my organization, even though that data wouldn’t be displayed publicly.
Background
To give some background on me/the issue at hand, I work for a company called EduSource, which has partnered with another company called Translating Data (info on that partnership here). A brief summary of what we do with Translating Data is that we help businesses and small community-based organizations make sense of their data and help them make business decisions based on our analysis.
My Problem
Because of the work we do, we encounter quite a bit of PII and PHI. Even though we weren’t publishing anything publicly, I still wanted to de-identify any data we had, as that was being used by myself and other developers to develop our ELT pipelines. We wanted as few eyes as possible to be able to see that data in its pure form, PII/PHI and all. This follows the security principle of least privilege, that dictates that any given process or user should have only as much privilege as they need to perform their intended function or work. Since our developers don’t need the data in its raw form in order to develop our pipelines, we didn’t want to allow them to.
In my particular case, the data we were receiving came in the form of .DBF files, which were converted into CSVs via an intermediary process. The CSVs were what I needed to de-identify, as those were what would be handled by the developers.
My Solution
Faker
Being an avid Python lover, I gravitated toward a Python solution. I quickly found a really great library called Faker. This library came stocked with loads of providers that generated random fake data (names, addresses, dates, etc.). This gave me an idea.
My Method
This diagram outlines the general way my solution would work. For every field in every row of the data, do the following:
- Combine it with some secret key.
- Input the combined value into a hashing function.
- Take the result and use it to seed Faker’s random generator.
- Generate a fake result.
This would accomplish a few things:
- It would enable us to be able to control Faker’s output, and
- it would help maintain referential integrity.
The latter point is the more important, in my view. If we just let Faker loose on the data, using whatever it seeded the generator with by default, we would lose referential integrity. The value 12345 might get changed to 54321 in File A, but in File B it might get changed to 98654–we’d have no way of knowing. With the method outlined above, a value in one file is changed to exactly the same value as any other occurrence of the value, within the same file and in other files.
Data-Anonymizer
The command-line tool I created to solve this problem I dubbed data-anonymizer.
The way it works is fairly simple. It accepts a YAML configuration, a key file, and the CSV file you wish to de-identify/anonymize. The YAML configuration tells the tool about the structure of the data and how each field should be changed.
Example
Say you had a CSV with this data:
first_name | last_name | ssn | dob |
---|---|---|---|
Jeffrey | Smith | 123-45-6789 | 1990/01/23 |
James | Cilantro | 098-76-5432 | 1978/04/24 |
A configuration file for a CSV like that could be:
delimiter: ','
columns_to_anonymize:
first_name:
type: first_name
last_name:
type: last_name
ssn:
type: custom
format: ###-##-####
dob:
type: datetime
format: %Y/%m/%d
preserve_year: true
safe_harbor: true
Pretty readable, I think. Each key under columns_to_anonymize
corresponds to the CSV header above each column (if the data doesn’t have headers, indices work as well). Each of the type
s corresponds to a field type that the tool is capable of handling. The custom
field type is a special catch-all that takes in a format and generates data based on that format. Regarding the above example, #
symbols are replaced with randomly generated numbers between 0 and 9, so a possible generated value for ssn
could be 504-37-6487.
Regarding dob
, the preserve_year
configuration tells the tool to only randomize the month and day of the value, while preserving the year in the original data. The safe_harbor
flag refers to the “Safe Harbor” method of de-identifying data, outlined by HIPAA. Specifically, this flag refers to this clause:
The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed: … All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
The safe_harbor
flag tells the tool that if the date is 90 years in the past or more, set the year of the result to 150 years ago. This way, any data referring to an individual of 90 years or older is aggregated into a single group of 150 year-old records.
Conclusion
Even when you’re not showing sensitive data publicly in any way, it’s still important to keep in mind the principle of least privilege, restricting users/developers from being able to see or do anything beyond what is strictly necessary.
Also, let me know if you find my tool useful!