Extract and obfuscate PII – Security and DevSecOps with Python

Extract and obfuscate PII

A person’s sensitive personal information is one of the most valuable things they have: financially, socially, and intimately. Compromising the security of this information can result in great harm to the person in all three of these areas. The privacy of people is of utmost importance, and we must take all possible measures in order to preserve it, especially when they entrust some bit of that information to the services that we provide for them.

So, what would be the best approach to start with this? Well, there are a lot of pre-built services such as Amazon Macie or Google Cloud’s DLP (short for Data Loss Prevention), and these services are handy, but you may not understand the inner workings of them since they are trained on certain machine learning algorithms that require very little input (if at all) from their users to redact and obfuscate a wide range of personal information.

But let’s say there is a piece of information that isn’t covered by these services, or you cannot use them because of compliance reasons. Then, you would have to begin creating your solution from scratch. Here again, Python is your friend. You can use Python to read files and find locations that contain sensitive information (based on certain criteria) and change that information in a way that hides or obfuscates it. This technique is the same as that for mining data by finding important patterns within the data, only in this case, instead of extracting the data, we are making sure that if some malicious actors tried to extract it, they would not be able to recover any vital information.

To demonstrate this, we are going to use a very simple regex or regular expression pattern for phone numbers to find them within a text and replace them with some form of redaction. We could try something a bit more complex, but it would still be the same concept, and if you are new to regex, I would suggest that you start somewhat slowly and discover the magic of regex. Truly, you will feel like a wizard.

Enough posturing for now; let’s get down to business. First, we need a regex that can be used to capture the pattern of a phone number. Don’t try and make a regex yourself unless you’re really trying to dive deep into it. For most use cases, you can find a suitable regex pre-made for you on the internet. In most cases, you can use it as it is, and in some very specific cases, you might have to make a couple of adjustments. So, the regex that covers phone number patterns (both using country codes and not) can be written like this: \d{3}-\d{3}-\d{4}’.

That probably looks like a bunch of gibberish to you, but it works, and you should trust that (someone probably lost their mind trying to get it just right). This works with dashes and without country codes (though, you can make a regex that works with both). Now, let’s implement the regex on a small passage that contains phone numbers:

#initial textimport retext = “The first number is 901-895-7906.
The second number is: 081-548-3262″#pattern for searchsearch_pattern = r’\d{3}-\d{3}-\d{4}’#replacement for patternreplacement_text = “<phone_number>”#text replacementnew_text = re.sub(search_pattern, replacement_text, text)#output given: “The first number is <phone_number>.
The second number is: <phone_number>”print(new_text)

Well, there you have it; this code finds phone numbers using a regex pattern and subsequently obfuscates it by replacing the phone numbers.

This is by far one of the simplest and most accessible ways to get regex but you can get even more complex with it. You can use it to obfuscate social security numbers, passport numbers, and pretty much anything that matches a pre-defined pattern.

So far, we have had security on the simplest, textual level. Now, we need to look towards security for our infrastructure. For this, we can look at container images since they are so prevalent, and validating them is so important. Let’s see how we can validate these images.

Leave a Reply

Your email address will not be published. Required fields are marked *