Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Azure
GitHub Repository: Azure/Azure-Sentinel-Notebooks
Path: blob/master/tutorials-and-examples/feature-tutorials/DataObfuscation.ipynb
3253 views
Kernel: Python 3

Data Obfuscation Library

Sharing data, creating documents and doing public demonstrations often require that data containing PII or other sensitive material be obfuscated.

MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values. You can use these functions on a single data items or entire DataFrames.

Contents

Import the module

import pandas as pd from msticpy.common.utility import md from msticpy.data import data_obfus

Read in some data for the examples

netflow_df = pd.read_csv("data/az_net_flows.csv") # list is imported as string from csv - convert back to list with eval def str_to_list(val): if isinstance(val, str): return eval(val) netflow_df["PublicIPs"] = netflow_df["PublicIPs"].apply(str_to_list) # Define subset of output columns out_cols = [ 'TenantId', 'TimeGenerated', 'FlowStartTime', 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs', 'SrcIP', 'DestIP', 'L4Protocol', 'AllExtIPs' ] netflow_df = netflow_df[out_cols]

Individual Obfuscation Functions

Here we're importing individual functions but you can access them with the single import statement above as:

data_obfus.hash_string(...)

etc.

Note In the next cell we're using a function to output documentation and examples.
You can ignore this. The usage of each function is show in the output of
the subsequent cells.

from msticpy.data.data_obfus import ( hash_dict, hash_ip, hash_item, hash_list, hash_sid, hash_string, replace_guid ) # Function to automate/format the examples below. You can ignore this def show_func(func, examples): func_name = func.__name__ if func.__name__.startswith("_"): func_name = func_name[1:] md(func_name, "bold") print(func.__doc__) md("Examples", "bold") for example in examples: if isinstance(example, tuple): arg, delim = example print( f"{func_name}('{arg}', delim='{delim}') =>", func(*example) ) else: print( f"{func_name}('{example}') =>", func(example) ) md("<br><hr><br>")
md("hash_string", "large, bold") md("hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric") show_func(hash_string, ["sensitive data", "42424"])
Hash a simple string. Parameters ---------- input_str : str The input string Returns ------- str The obfuscated output string
hash_string('sensitive data') => jdiqcnrqmlidkd hash_string('42424') => 98478
md("hash_item", "large, bold") md("hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.") show_func(hash_item, [("sensitive data", " "), ("most-sensitive-data/here", " /-")])
Hash a simple string. Parameters ---------- input_item : str The input string delim: str, optional A string of delimiters to use to split the input string prior to hashing. Returns ------- str The obfuscated output string
hash_item('sensitive data', delim=' ') => kdneqoiia laoe hash_item('most-sensitive-data/here', delim=' /-') => kmea-kdneqoiia-laoe/fcec
md("hash_ip", "large, bold") md("hash_ip will output random mappings of input IP V4 and V6 addresses.") md("Within a Python session the mapping will remain constant.") show_func(hash_ip, [ "192.168.3.1", "2001:0db8:85a3:0000:0000:8a2e:0370:7334", ["192.168.3.1", "192.168.5.2", "192.168.10.2"], ])
Hash IP address or list of IP addresses. Parameters ---------- input_item : Union[List[str], str] List of IP addresses or single IP address. Returns ------- Union[List[str], str] List of hashed addresses or single address. (depending on input)
hash_ip('192.168.3.1') => 192.168.84.105 hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334') => 85d6:7819:9cce:9af1:9af1:24ad:d338:7d03 hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']') => ['192.168.84.105', '192.168.172.202', '192.168.232.202']
md("hash_sid", "large, bold") md("hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)") show_func(hash_sid, ["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"])
Hash a SID preserving well-known SIDs and the RID. Parameters ---------- sid : str SID string Returns ------- str Hashed SID
hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004') => S-1-5-21-3321821741-636458740-4143214142-1004 hash_sid('S-1-5-18') => S-1-5-18
md("hash_list", "large, bold") md("hash_list will randomize a list of items preserving the list structure.") show_func(hash_list, [["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"]])
Hash list of strings. Parameters ---------- item_list : List[str] Input list Returns ------- List[str] Hashed list
hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']') => ['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']
md("hash_dict", "large, bold") md("hash_dict will randomize a dict of items preserving the structure and the dict keys.") show_func(hash_dict, [{"SID1": "S-1-5-21-1180699209-877415012-3182924384-1004", "SID2": "S-1-5-18"}])
Hash dictionary values. Parameters ---------- item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]] Input item can be a Dict of strings, lists or other dictionaries. Returns ------- Dict[str, Any] Dictionary with hashed values.
hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}') => {'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}
md("replace_guid", "large, bold") md("replace_guid will output a random UUID mapped to the input.") md("An input GUID will be mapped to the same newly-generated output UUID") md("You can see that UUID #4 is the same as #1 and mapped to the same output UUID.") show_func(replace_guid, [ "cf1b0b29-08ae-4528-839a-5f66eca2cce9", "ed63d29e-6288-4d66-b10d-8847096fc586", "ac561203-99b2-4067-a525-60d45ea0d7ff", "cf1b0b29-08ae-4528-839a-5f66eca2cce9", ])
Replace GUID/UUID with mapped random UUID. Parameters ---------- guid : str Input UUID. Returns ------- str Mapped UUID
replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9 replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586') => 52cd2814-b5e4-48bd-80f2-51b503e50467 replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff') => ef059dc7-2d6e-4506-8619-05b346a6bc6b replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9

Obfuscating DataFrames

We can use the msticpy pandas extension to obfuscate an entire DataFrame.

The obfuscation library contains a mapping for a number of common field names. You can view this list by displaying the attribute:

data_obfus.OBFUS_COL_MAP

In the first example, the TenantId, ResourceGroup, VMName have been obfuscated.

display(netflow_df.head(3)) netflow_df.head(3).mp_mask.mask()

Adding custom column mappings

Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.

We can add these columns to a custom mapping dictionary and re-run the obfuscation. See the later section on Creating Custom Mappings.

col_map = { "VMName": ".", "VMIPAddress": "ip", "PublicIPs": "ip", "AllExtIPs": "ip" } netflow_df.head(3).mp_mask.mask()

ofuscate_df function

You can also call the standard function obfuscate_df to perform the same operation on the dataframe passed as the data parameter.

data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)

Creating custom mappings

A custom mapping dictionary has entries in the following form:

"ColumnName": "operation"

The operation defines the type of obfuscation method used for that column. Both the column and the operation code must be quoted.

operation codeobfuscation function
"uuid"replace_guid
"ip"hash_ip
"str"hash_string
"dict"hash_dict
"list"hash_list
"sid"hash_sid
"null""null"*
Nonehash_str*
delims_strhash_item*

*The last three items require some explanation:

  • null - the null operation code means set the value to empty - i.e. delete the value in the output frame.

  • None (i.e. the dictionary value is None) default to hash_string.

  • delims_str - any string other than those named above is assumed to be a string of delimiters. See next section for a discussion of use of delimiters.


NOTE If you want to only use custom mappings and ignore the builtin
mapping table, specify use_default=False as a parameter to either
mp_obf.obfuscate() or obfuscate_df


Using hash_item with delimiters to preserve the structure/look of the hashed input

Using hash_item with a delimiters string lets you create output that somewhat resembles the input type. The delimiters string is specified as a simple string of delimiter characters, e.g. "@\,-"

The input string is broken into substrings using each of the delimiters in the delims_str. The substrings are individually hashed and the resulting substrings joined together using the original delimiters. The string is split in the order of the characters in the delims string.

This allows you to create hashed values that bear some resemblance to the original structure of the string. This might be useful for email address, qualified domain names and other structure text.

For example : [email protected]

Using the simple hash_string function the output bears no resemblance to an email address

hash_string("[email protected]")
'prqocjmdpbodrafn'

Using hash_item and specifying the expected delimiters we get something like an email address in the output.

hash_item("[email protected]", "@.")

You use hash_item in your Custom Mapping dictionary by specifying a delimiters string as the operation.

Checking Your Obfuscation

You should check that you have correctly masked all of the columns needed. There is a function check_obfuscation to do this.

Use silent=False to print out the results. If you use silent=True (the default it will return 2 lists of unchanged and obfuscated columns)

data_obfus.check_obfuscation( data: pandas.core.frame.DataFrame, orig_data: pandas.core.frame.DataFrame, index: int = 0, silent=True, ) -> Union[Tuple[List[str], List[str]], NoneType] Check the obfuscation results for a row. Parameters ---------- data : pd.DataFrame Obfuscated DataFrame orig_data : pd.DataFrame Original DataFrame index : int, optional The row to check, by default 0 silent: bool If False the function returns no output and returns lists of changed and unchanged columns. By default, True Returns ------- Optional[Tuple[List[str], List[str]]] : If silent is True returns a tuple of unchanged, changed items. If False, returns None.

Note by default this will check only the first row of the data. You can check other rows using the index parameter.

Warning The two DataFrames should have a matching index and ordering because the check works by comparing the values in each column, judging that column values that do not match have been obfuscated.

We first test the partially-obfuscated DataFrame from earlier.

partly_obfus_df = netflow_df.head(3).mp_mask.mask() fully_obfus_df = netflow_df.head(3).mp_mask.mask(column_map=col_map) data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)
===== Start Check ==== Unchanged columns: ------------------ AllExtIPs: 65.55.44.109 FlowStartTime: 2019-02-12 13:00:07.000 L4Protocol: T PublicIPs: ['65.55.44.109'] TimeGenerated: 2019-02-12 14:22:40.697 VMIPAddress: 10.0.3.5 Obfuscated columns: -------------------- DestIP: nan ----> nan ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa SrcIP: nan ----> nan TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293 VMName: msticalertswin1 ----> fmlmbnlpdcbnbnn ====== End Check =====

Checking the fully-obfuscated data set

data_obfus.check_obfuscation(fully_obfus_df, netflow_df.head(3), silent=False)
===== Start Check ==== Unchanged columns: ------------------ FlowStartTime: 2019-02-12 13:00:07.000 L4Protocol: T TimeGenerated: 2019-02-12 14:22:40.697 Obfuscated columns: -------------------- AllExtIPs: 65.55.44.109 ----> 100.11.187.82 DestIP: nan ----> nan PublicIPs: ['65.55.44.109'] ----> ['100.11.187.82'] ResourceGroup: asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa SrcIP: nan ----> nan TenantId: 52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293 VMIPAddress: 10.0.3.5 ----> 10.112.51.93 VMName: msticalertswin1 ----> fmlmbnlpdcbnbnn ====== End Check =====

Appendix

import tabulate print(tabulate.tabulate(netflow_df.head(3), tablefmt="rst", showindex=False, headers="keys"))