The Liability of Testing with PII Data
How do you create sample data to test your data pipeline in a privacy-preserving way? Do you have sensitive production data lingering around in other environments? Do you have employees walking around with real personally identifiable information (PII) data on their laptops? Can you imagine what a liability this would be to any company’s cybersecurity and data privacy policy? Or what would happen in the event of a hack?
Building the Tool In House was Our Solution
While researching existing third-party tools to scrub PII from data, the Engineering team at Spokeo found it necessary to build an in-house solution. The Software Development/Design Engineer in Test (SDET) team at Spokeo built a data scrubber tool that takes input data containing real values from the production environment and creates anonymized output data in the original format. The input can be a file in text, csv, parquet, or even json object with multiple levels of nested structure. The data scrubber has a configuration file where the user can specify which fields contain sensitive data that needs to be scrubbed and replaced with synthetic values that look real – for instance firstName, lastName, phoneNumber, dateOfBirth, mailingAddress, and numerous other data fields that might be sensitive to the consumer or your business.
This project started out as a much simpler script. The use cases grew very quickly calling for a well engineered tool. We created a factory that can produce a value for any given “column_name” or “key” in json. The user has the option to fine tune the synthetic data generated by the factory by specifying certain parameters for each of the fields. We wanted to make sure that the data integrity is intact even as we scrub the data.
Key Challenges
There were several challenges along the way as we completed the proof of concept (POC) with limited features and subsequently used the design as the solution to implement the remaining features. A few of the challenges that the team resolved quickly are below:
- Related columns: Each record must maintain related column values. Ex: the values in first_name, last_name, and full_name must make sense. So we couldn’t simply create a random value for the given column because some values have partial data that has to agree with other columns.
- Dependent columns: Certain columns should only have a value if the value of another column is a specific string. Ex: “date_of_marriage” should only have a date if the value for the “married” column is “Yes”.
- Logic within the values: There were several columns that needed to be logically correct. Expiration date cannot fall before the Issue date, or updated_date cannot fall before the created_date, date_of_birth of an adult must put their age above 18 years old and below a reasonable number, unless the person is deceased.
Future Uses for the Tool
After tackling the problem of scrubbing PII from test data, we have started working on generalizing the tool to fit other teams’ needs. Some of the teams at Spokeo work with large files of data where values across different records are related. For instance: a group of family members sharing the same address. If we scrub one person’s address, we must maintain consistency by making the family members’ address the same value. Otherwise, we might end up with data integrity issues within the test/sample data. The tool is still in early stages, but as soon as it becomes a valuable enough project that other teams can benefit from it, we’d love to explore the opportunity to open-source it.
Join Spokeo Engineering Team
Spokeo is tackling various big-data challenges across Engineering, QA, Product, and other disciplines to help redefine digital identity. If you are interested in solving meaningful problems about digital identity, please check out our job opportunities: https://www.spokeo.com/careers