Path: blob/master/tutorials-and-examples/feature-tutorials/DataObfuscation.ipynb
3253 views
Data Obfuscation Library
Sharing data, creating documents and doing public demonstrations often require that data containing PII or other sensitive material be obfuscated.
MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values. You can use these functions on a single data items or entire DataFrames.
Contents
Import the module
Read in some data for the examples
Individual Obfuscation Functions
Here we're importing individual functions but you can access them with the single import statement above as:
etc.
Note In the next cell we're using a function to output documentation and examples.
You can ignore this. The usage of each function is show in the output of
the subsequent cells.
Obfuscating DataFrames
We can use the msticpy pandas extension to obfuscate an entire DataFrame.
The obfuscation library contains a mapping for a number of common field names. You can view this list by displaying the attribute:
In the first example, the TenantId, ResourceGroup, VMName have been obfuscated.
Adding custom column mappings
Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.
We can add these columns to a custom mapping dictionary and re-run the obfuscation. See the later section on Creating Custom Mappings.
ofuscate_df function
You can also call the standard function obfuscate_df to perform the same operation on the dataframe passed as the data parameter.
Creating custom mappings
A custom mapping dictionary has entries in the following form:
The operation defines the type of obfuscation method used for that column. Both the column and the operation code must be quoted.
| operation code | obfuscation function |
|---|---|
| "uuid" | replace_guid |
| "ip" | hash_ip |
| "str" | hash_string |
| "dict" | hash_dict |
| "list" | hash_list |
| "sid" | hash_sid |
| "null" | "null"* |
| None | hash_str* |
| delims_str | hash_item* |
*The last three items require some explanation:
null - the
nulloperation code means set the value to empty - i.e. delete the value in the output frame.None (i.e. the dictionary value is
None) default to hash_string.delims_str - any string other than those named above is assumed to be a string of delimiters. See next section for a discussion of use of delimiters.
NOTE If you want to only use custom mappings and ignore the builtin
mapping table, specifyuse_default=Falseas a parameter to either
mp_obf.obfuscate()orobfuscate_df
Using hash_item with delimiters to preserve the structure/look of the hashed input
Using hash_item with a delimiters string lets you create output that somewhat resembles the input type. The delimiters string is specified as a simple string of delimiter characters, e.g. "@\,-"
The input string is broken into substrings using each of the delimiters in the delims_str. The substrings are individually hashed and the resulting substrings joined together using the original delimiters. The string is split in the order of the characters in the delims string.
This allows you to create hashed values that bear some resemblance to the original structure of the string. This might be useful for email address, qualified domain names and other structure text.
For example : [email protected]
Using the simple hash_string function the output bears no resemblance to an email address
Using hash_item and specifying the expected delimiters we get something like an email address in the output.
You use hash_item in your Custom Mapping dictionary by specifying a delimiters string as the operation.
Checking Your Obfuscation
You should check that you have correctly masked all of the columns needed. There is a function check_obfuscation to do this.
Use silent=False to print out the results. If you use silent=True (the default it will return 2 lists of unchanged and obfuscated columns)
Note by default this will check only the first row of the data. You can check other rows using the index parameter.
Warning The two DataFrames should have a matching index and ordering because the check works by comparing the values in each column, judging that column values that do not match have been obfuscated.
We first test the partially-obfuscated DataFrame from earlier.
Checking the fully-obfuscated data set