GitHub Repository: Azure/Azure-Sentinel-Notebooks
Path: blob/master/tutorials-and-examples/feature-tutorials/DataObfuscation.ipynb
³²⁵³ views

Kernel: Python 3

Data Obfuscation Library

Sharing data, creating documents and doing public demonstrations often require that data containing PII or other sensitive material be obfuscated.

MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values. You can use these functions on a single data items or entire DataFrames.

Import the module

In [1]:

import pandas as pd
from msticpy.common.utility import md
from msticpy.data import data_obfus

Read in some data for the examples

In [2]:


netflow_df = pd.read_csv("data/az_net_flows.csv")
# list is imported as string from csv - convert back to list with eval
def str_to_list(val):
    if isinstance(val, str):
        return eval(val)
netflow_df["PublicIPs"] = netflow_df["PublicIPs"].apply(str_to_list)

# Define subset of output columns
out_cols = [
    'TenantId', 'TimeGenerated', 'FlowStartTime',
    'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',
    'SrcIP', 'DestIP', 'L4Protocol', 'AllExtIPs'
]
netflow_df = netflow_df[out_cols]

Individual Obfuscation Functions

Here we're importing individual functions but you can access them with the single import statement above as:

data_obfus.hash_string(...)

etc.

Note In the next cell we're using a function to output documentation and examples.
You can ignore this. The usage of each function is show in the output of
the subsequent cells.

In [3]:

from msticpy.data.data_obfus import (
    hash_dict,
    hash_ip,
    hash_item,
    hash_list,
    hash_sid,
    hash_string,
    replace_guid
)

# Function to automate/format the examples below. You can ignore this
def show_func(func, examples):
    func_name = func.__name__
    if func.__name__.startswith("_"):
        func_name = func_name[1:]
    md(func_name, "bold")
    print(func.__doc__)
    md("Examples", "bold")
    for example in examples:
        if isinstance(example, tuple):
            arg, delim = example
            print(
                f"{func_name}('{arg}', delim='{delim}') =>", func(*example)
            )
        else:
            print(
                f"{func_name}('{example}') =>", func(example)
            )
    md("<br><hr><br>")

In [4]:

md("hash_string", "large, bold")
md("hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric")
show_func(hash_string, ["sensitive data", "42424"])

Out[4]:

    Hash a simple string.

    Parameters
    ----------
    input_str : str
        The input string

    Returns
    -------
    str
        The obfuscated output string

    

hash_string('sensitive data') => jdiqcnrqmlidkd
hash_string('42424') => 98478

In [5]:

md("hash_item", "large, bold")
md("hash_item allows specification of delimiters. Useful for preserving the look of domains, emails, etc.")
show_func(hash_item, [("sensitive data", " "), ("most-sensitive-data/here", " /-")])

Out[5]:

    Hash a simple string.

    Parameters
    ----------
    input_item : str
        The input string
    delim: str, optional
        A string of delimiters to use to split the input string
        prior to hashing.

    Returns
    -------
    str
        The obfuscated output string

    

hash_item('sensitive data', delim=' ') => kdneqoiia laoe
hash_item('most-sensitive-data/here', delim=' /-') => kmea-kdneqoiia-laoe/fcec

In [6]:

md("hash_ip", "large, bold")
md("hash_ip will output random mappings of input IP V4 and V6 addresses.")
md("Within a Python session the mapping will remain constant.")
show_func(hash_ip, [
    "192.168.3.1", 
    "2001:0db8:85a3:0000:0000:8a2e:0370:7334",
    ["192.168.3.1", "192.168.5.2", "192.168.10.2"],
])

Out[6]:

    Hash IP address or list of IP addresses.

    Parameters
    ----------
    input_item : Union[List[str], str]
        List of IP addresses or single IP address.

    Returns
    -------
    Union[List[str], str]
        List of hashed addresses or single address.
        (depending on input)

    

hash_ip('192.168.3.1') => 192.168.84.105
hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334') => 85d6:7819:9cce:9af1:9af1:24ad:d338:7d03
hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']') => ['192.168.84.105', '192.168.172.202', '192.168.232.202']

In [7]:

md("hash_sid", "large, bold")
md("hash_sid will randomize the domain-specific parts of a SID. It preserves built-in SIDs and well known RIDs (e.g. Admins -500)")
show_func(hash_sid, ["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"])

Out[7]:

    Hash a SID preserving well-known SIDs and the RID.

    Parameters
    ----------
    sid : str
        SID string

    Returns
    -------
    str
        Hashed SID

    

hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004') => S-1-5-21-3321821741-636458740-4143214142-1004
hash_sid('S-1-5-18') => S-1-5-18

In [8]:

md("hash_list", "large, bold")
md("hash_list will randomize a list of items preserving the list structure.")
show_func(hash_list, [["S-1-5-21-1180699209-877415012-3182924384-1004", "S-1-5-18"]])

Out[8]:

    Hash list of strings.

    Parameters
    ----------
    item_list : List[str]
        Input list

    Returns
    -------
    List[str]
        Hashed list

    

hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']') => ['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']

In [9]:

md("hash_dict", "large, bold")
md("hash_dict will randomize a dict of items preserving the structure and the dict keys.")
show_func(hash_dict, [{"SID1": "S-1-5-21-1180699209-877415012-3182924384-1004", "SID2": "S-1-5-18"}])

Out[9]:

    Hash dictionary values.

    Parameters
    ----------
    item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]]
        Input item can be a Dict of strings, lists or other
        dictionaries.

    Returns
    -------
    Dict[str, Any]
        Dictionary with hashed values.

    

hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}') => {'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}

In [10]:

md("replace_guid", "large, bold")
md("replace_guid will output a random UUID mapped to the input.")
md("An input GUID will be mapped to the same newly-generated output UUID")
md("You can see that UUID #4 is the same as #1 and mapped to the same output UUID.")
show_func(replace_guid, [
    "cf1b0b29-08ae-4528-839a-5f66eca2cce9",
    "ed63d29e-6288-4d66-b10d-8847096fc586",
    "ac561203-99b2-4067-a525-60d45ea0d7ff",
    "cf1b0b29-08ae-4528-839a-5f66eca2cce9",
])

Out[10]:

        Replace GUID/UUID with mapped random UUID.

        Parameters
        ----------
        guid : str
            Input UUID.

        Returns
        -------
        str
            Mapped UUID

        

replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9
replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586') => 52cd2814-b5e4-48bd-80f2-51b503e50467
replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff') => ef059dc7-2d6e-4506-8619-05b346a6bc6b
replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9') => 01ae8633-22e5-480f-b884-fc48588c25d9

Obfuscating DataFrames

We can use the msticpy pandas extension to obfuscate an entire DataFrame.

The obfuscation library contains a mapping for a number of common field names. You can view this list by displaying the attribute:

data_obfus.OBFUS_COL_MAP

In the first example, the TenantId, ResourceGroup, VMName have been obfuscated.

In [12]:

display(netflow_df.head(3))
netflow_df.head(3).mp_mask.mask()

Out[12]:

Adding custom column mappings

Note in the previous example that the VMIPAddress, PublicIPs and AllExtIPs columns were unchanged.

We can add these columns to a custom mapping dictionary and re-run the obfuscation. See the later section on Creating Custom Mappings.

In [14]:

col_map = {
    "VMName": ".",
    "VMIPAddress": "ip", 
    "PublicIPs": "ip",
    "AllExtIPs": "ip"
}

netflow_df.head(3).mp_mask.mask()

Out[14]:

ofuscate_df function

You can also call the standard function obfuscate_df to perform the same operation on the dataframe passed as the data parameter.

In [15]:

data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)

Out[15]:

Creating custom mappings

A custom mapping dictionary has entries in the following form:

    "ColumnName": "operation"

The operation defines the type of obfuscation method used for that column. Both the column and the operation code must be quoted.

operation code	obfuscation function
"uuid"	replace_guid
"ip"	hash_ip
"str"	hash_string
"dict"	hash_dict
"list"	hash_list
"sid"	hash_sid
"null"	"null"*
None	hash_str*
delims_str	hash_item*

*The last three items require some explanation:

null - the null operation code means set the value to empty - i.e. delete the value in the output frame.
None (i.e. the dictionary value is None) default to hash_string.
delims_str - any string other than those named above is assumed to be a string of delimiters. See next section for a discussion of use of delimiters.

NOTE If you want to only use custom mappings and ignore the builtin
mapping table, specify use_default=False as a parameter to either
mp_obf.obfuscate() or obfuscate_df

Using `hash_item` with delimiters to preserve the structure/look of the hashed input

Using hash_item with a delimiters string lets you create output that somewhat resembles the input type. The delimiters string is specified as a simple string of delimiter characters, e.g. "@\,-"

The input string is broken into substrings using each of the delimiters in the delims_str. The substrings are individually hashed and the resulting substrings joined together using the original delimiters. The string is split in the order of the characters in the delims string.

This allows you to create hashed values that bear some resemblance to the original structure of the string. This might be useful for email address, qualified domain names and other structure text.

For example : [email protected]

Using the simple hash_string function the output bears no resemblance to an email address

In [16]:

hash_string("[email protected]")

Out[16]:

'prqocjmdpbodrafn'

Using hash_item and specifying the expected delimiters we get something like an email address in the output.

In [17]:

hash_item("[email protected]", "@.")

Out[17]:

'[email protected]'

You use hash_item in your Custom Mapping dictionary by specifying a delimiters string as the operation.

Checking Your Obfuscation

You should check that you have correctly masked all of the columns needed. There is a function check_obfuscation to do this.

Use silent=False to print out the results. If you use silent=True (the default it will return 2 lists of unchanged and obfuscated columns)

data_obfus.check_obfuscation(
    data: pandas.core.frame.DataFrame,
    orig_data: pandas.core.frame.DataFrame,
    index: int = 0,
    silent=True,
) -> Union[Tuple[List[str], List[str]], NoneType]

Check the obfuscation results for a row.
Parameters
----------
data : pd.DataFrame
    Obfuscated DataFrame
orig_data : pd.DataFrame
    Original DataFrame
index : int, optional
    The row to check, by default 0
silent: bool
    If False the function returns no output and
    returns lists of changed and unchanged columns.
    By default, True

Returns
-------
Optional[Tuple[List[str], List[str]]] :
    If silent is True returns a tuple of unchanged, changed
    items. If False, returns None.

Note by default this will check only the first row of the data. You can check other rows using the index parameter.

Warning The two DataFrames should have a matching index and ordering because the check works by comparing the values in each column, judging that column values that do not match have been obfuscated.

We first test the partially-obfuscated DataFrame from earlier.

In [19]:

partly_obfus_df = netflow_df.head(3).mp_mask.mask()
fully_obfus_df = netflow_df.head(3).mp_mask.mask(column_map=col_map)

data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)

Out[19]:

===== Start Check ====
Unchanged columns:
------------------
AllExtIPs: 65.55.44.109
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
PublicIPs: ['65.55.44.109']
TimeGenerated: 2019-02-12 14:22:40.697
VMIPAddress: 10.0.3.5

Obfuscated columns:
--------------------
DestIP:   nan ----> nan
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293
VMName:   msticalertswin1 ----> fmlmbnlpdcbnbnn
====== End Check =====

Checking the fully-obfuscated data set

In [20]:

data_obfus.check_obfuscation(fully_obfus_df, netflow_df.head(3), silent=False)

Out[20]:

===== Start Check ====
Unchanged columns:
------------------
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
TimeGenerated: 2019-02-12 14:22:40.697

Obfuscated columns:
--------------------
AllExtIPs:   65.55.44.109 ----> 100.11.187.82
DestIP:   nan ----> nan
PublicIPs:   ['65.55.44.109'] ----> ['100.11.187.82']
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> f9ef3428-3ccb-4ecd-8466-dbedc7044293
VMIPAddress:   10.0.3.5 ----> 10.112.51.93
VMName:   msticalertswin1 ----> fmlmbnlpdcbnbnn
====== End Check =====

Appendix

In [ ]:

import tabulate
print(tabulate.tabulate(netflow_df.head(3), tablefmt="rst", showindex=False, headers="keys"))

Data Obfuscation Library

Contents

Import the module

Read in some data for the examples

Individual Obfuscation Functions

Obfuscating DataFrames

Adding custom column mappings

ofuscate_df function

Creating custom mappings

Using `hash_item` with delimiters to preserve the structure/look of the hashed input

Checking Your Obfuscation

Appendix

Product

Resources

Company

Data Obfuscation Library

Contents

Import the module

Read in some data for the examples

Individual Obfuscation Functions

Obfuscating DataFrames

Adding custom column mappings

ofuscate_df function

Creating custom mappings

Using hash_item with delimiters to preserve the structure/look of the hashed input

Checking Your Obfuscation

Appendix

Using `hash_item` with delimiters to preserve the structure/look of the hashed input