Path: blob/master/machine-learning-notebooks/MasqueradingProcessNameAnomaly.ipynb
3250 views
Masquerading Process Name Anomaly Algorithm
Notebook Version: 1.0
Python Version: Python 3.8
Required Packages: azure_sentinel_utilities, damerauLevenshtein, azureml-synapse
Platforms Supported: Azure Synapse Workspace, Azure Sentinel, Azure Log Analytics Workspace, Storage Account, Azure Machine Learning Notebooks connected to Azure Synapse Workspace
Data Source Required: Yes
Data Source: SecurityEvents
Spark Version: 3.1 or above
Description
This notebook demonstrates how to apply custom machine learning algorithms to data in Azure Sentinel. It showcases a Masquerading Process Name anomaly algorithm, which looks for Windows process creation events for processes whose names are similar to known normal processes. It is a very common attack vector for malicious processes to masquerade as known normal processes by having names similar to known normal ones but different by a single character. Since these are easy to miss when simply looked at, they can succeed at running malicious code on your machine. Examples of such malicious processes are scvhost.exe, svch0st.exe, etc. -> Known normal process here was svchost.exe.
The data used here is from the SecurityEvents table with EventID = 4688. These correspond to process creation events from Windows machines.
You will have to export this data from your Log Analytics workspace into a storage account. Instructions for this LA export mechanism can be found here: LA export mechanism.
Here is a Blog explaining data export
Data is then loaded from this storage account container and the results are published to your Log Analytics resource.
This notebook can be run either from the AML platform or directly off of Synapse. Based on what you choose, the setup will differ. Please follow either section A or B, that suits you, for setup before running the main pyspark code.
A. Running on AML
You will need to configure your environment to use a Synapse cluster with your AML workspace. For this, you require to setup the Synapse compute and attach the necessary packages/wheel files. Then, for the rest of the code, you need to convert to using Synapse language by marking each cell with a %%synapse header.
Steps:
Install AzureML Synapse package on the AML compute to use spark magics
Configure AzureML and Azure Synapse Analytics
Attach the required packages and wheel files to the compute.
Start Spark session
1. Install AzureML Synapse package on the AML compute to use spark magics
You will have to setup the AML compute that is attached to your notebook with some packages so that the rest of this code can run properly.
2. Configure AzureML and Azure Synapse Analytics
Please use notebook Configurate Azure ML and Azure Synapse Analytics to configure environment.
The notebook will configure existing Azure synapse workspace to create and connect to Spark pool. You can then create linked service and connect AML workspace to Azure Synapse workspaces. You can skip point 6 which exports data from Log Analytics to Datalake Storage Gen2 because you have already set up the data export to the storage account above.
Note: Specify the input parameters in below step in order to connect AML workspace to synapse workspace using linked service.
Authentication to Azure Resources:
In this step we will connect aml workspace to linked service connected to Azure Synapse workspace
3. Attach the required packages and wheel files to the compute.
You will have to setup the spark pool that is attached to your notebook with some packages so that the rest of this code can run properly.
Please follow these steps:
On the AML Studio left menu, navigate to Linked Services
Click on the name of the Link Service you want to use
Select Spark pools tab
Click the Spark pool you want to use.
In Synapse Properties, click the Synapse workspace. It will open the workspace in a new tab.
Click on 'Manage' in the left window.
Click on 'Apache Spark pools' in the left window.
Select the '...' in the pool you want to use and click on 'Packages'.
Now upload the following two files in this blade.
a. Create a requirements.txt with the following line in it and upload it to the Requirements section
b. Download the azure_sentinel_utilities whl package from Repo
c. Then select this package from there in this tab.
4. Start Spark session
Enter your Synapse Spark compute below. To find the Spark compute, please follow these steps:
On the AML Studio left menu, navigate to Linked Services
Click on the name of the Link Service you want to use
Select Spark pools tab
Get the Name of the Spark pool you want to use.
B. Running directly on Synapse
You will need to attach the required packages and wheel files to the cluster you intend to use with this notebook. Follow Step 3 above to complete this.
Common Code
From here on below, all the steps are the same for both AML and Synapse platforms. The main difference is, if you have setup through AML then pre-pend each pyspark block with the synapse header %%synapse. For Synapse runs, don't add that header.
One-time: Set credentials in KeyVault so the notebook can access
Store the following secrets in the KeyVault
Storage Account connection string: the keyName should be 'saConnectionString'
Log Analytics workspaceSharedKey: the keyName should be 'wsSharedKey'
Log Analytics workspaceId: the keyName should be 'wsId'
Log Analytics workspaceResourceId: the keyName should be 'wsResourceId'
Add the KeyVault as a linked service to your Azure Synapse workspace
Ensure the settings in the cell below are filled in.
These are some customizable variables which are used further in the code.
Making Connections to the Storage Account and KeyVaults for user credentials
This cell defines the helper functions.
calcDist() -> calculates the Levenshtein distance. This is a measure of the difference between two sequences by calculating the edit distance. It takes into account the number of different characters in the sequence as well as the length of the sequences. If the extensions of both the processes are the same, then it excludes the extension when calculating the distance.
getRandomTimeStamp() -> calculates a random timestamp. This is added to the synthetically created process events.
getKnownNormalProcs() -> creates a hardcoded list of known normal processes which malicious processes may masquerade as.
getSyntheticMaliciousProcs() -> creates a list of potentially malicious processes by modifying a single random letter of the normal processes to form new names.
getSyntheticEvents() -> synthetically creates a list of 4688 events. It gets the known normal and synthetically created malicious process names from previous functions and creates entire events using time stamp and process path.
Next, we define the schema of the input and get the raw customer 4688 events. We are using the following details: EventID, NewProcessName, Process, TimeGenerated.
Here we append synthetically created normal and malicious process creation events. This is being done to show performance of this algorithm by ensuring some masquerading process names are caught.
We are comparing frequent to infrequent processes to decide maliciousness of a process.
The approach here is that we consider processes occuring more than 'frequentThreshold' percentile of the time as normal and those occuring less than 'infrequentThreshold' percentile of the time as potentially malicious. Those in the middle range are excluded from analysis because they fall in the grey area of being of relatively high popularity but falling below the first threshold.
The values of these thresholds can be customized by you based on your needs.
Next we find the Levenshtein distance between the normal and potentially malicious processes to check whether we have any masquerading processes.
It is always useful to have the corresponding process paths from where the processes spawned to understand maliciousness of the process. This cell finds the paths of all the processes, for context. We also filter based on a threshold values which you can alter to better fit your criteria.
To remove noise, we are extracting only the process names and path information of the potentially malicious process names.
Sending results to Log Analytics