Path: blob/master/Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse.ipynb
3249 views
Guided Hunting - Detect potential network beaconing using Apache Spark via Azure Synapse
Notebook Version: 1.0
Python Version: Python 3.10 - AzureML
Required Packages: azureml-synapse, Msticpy, azure-storage-file-datalake
Platforms Supported: Azure Machine Learning Notebooks connected to Azure Synapse Workspace
Data Source Required: Yes
Data Source: CommonSecurityLogs
Spark Version: 3.1 or above
Description
In this sample guided scenario notebook, we will demonstrate how to set up continuous data pipeline to store data into azure data lake storage (ADLS) and then hunt on that data at scale using distributed processing via Azure Synapse workspace connected to serverless Spark pool. Once historical dataset is available in ADLS , we can start performing common hunt operations, create a baseline of normal behavior using PySpark API and also apply data transformations to find anomalous behaviors such as periodic network beaconing as explained in the blog - Detect Network beaconing via Intra-Request time delta patterns in Microsoft Sentinel - Microsoft Tech Community. You can use various other spark API to perform other data transformation to understand the data better. The output generated can also be further enriched to populate Geolocation information and also visualize using Msticpy capabilities to identify any anomalies. .
*** Python modules download may be needed. ***
*** Please run the cells sequentially to avoid errors. Please do not use "run all cells". ***
Table of Contents
Warm-up
Authentication to Azure Resources
Configure Azure ML and Azure Synapse Analytics
Load the Historical and current data
Data Wrangling using Spark
Enrich the results
Conclusion
Warm-up
Note: Install below packages only for the first time and restart the kernel once done.
*** ***
Configure Azure ML and Azure Synapse Analytics
Please use notebook Configurate Azure ML and Azure Synapse Analytics to configure environment.
The notebook will configure existing Azure synapse workspace to create and connect to Spark pool. You can then create linked service and connect AML workspace to Azure Synapse workspaces.
It will also configure data export rules to export data from Log analytics workspace CommonSecurityLog table to Azure Data lake storage Gen 2.
Note: Specify the input parameters in below step in order to connect AML workspace to synapse workspace using linked service.
Authentication to Azure Resources
In this step we will connect aml workspace to linked service connected to Azure Synapse workspace
Start Spark Session
Enter your Synapse Spark compute below. To find the Spark compute, please follow these steps:
On the AML Studio left menu, navigate to Linked Services
Click on the name of the Link Service you want to use
Select Spark pools tab
Get the Name of the Spark pool you want to use.
Data Preparation
In this step, we will define several details associated with ADLS account and specify input date and lookback period to calculate baseline. Based on the input dates and lookback period , we will load the data.
Load Current day
In this step, you will load the data based on the input date specified.
Load Historical data
You can also perform the analysis on all historical data available in your ADLS account. The notebook is currently configured to run only on current date specified in input.
If you need to perform the same analysis on historical data, run the cell below and under Data Wrangling using Spark -> Filtering Data code cell, replace current_df with historical_df variable.
Otherwise SKIP running below cell as it will result in an error if you do not have historical data
Data Wrangling using Spark
Filtering data
In this step, we will prepare dataset by filtering logs to only destination as Public/external IPs. For this, we are using regex and rlike spark API to filter internal to external network traffic.
Baseline data to filter known Source IP and Destination IPs
In this step, you can either analyze Historical data or current data to filter source IP and destination IP per defined criteria.
In below example, we are filtering the Source IP which has daily event count more than the specified threshold.
Also, you can filter the destination IPs whom very less source IPs are connecting. This will reduce false positives be filtering destination IPs which are commonly seen from internal systems which are likely benign.
Rank the datasets and Calculate PercentageBeaconing
In this step, we will use spark to wrangle the data by applying below transformations.
Sort the dataset per SourceIP
Calculate the time difference between First event and next event.
Partition dataset per Source IP, Destination IP, Destination Port
Window dataset into consecutive 3 to Calculate the Timedeltalistcount based on cluster of 1-3 timedelta events.
Calculate percentagebeacon between TotalEventscount and Timedeltalistcount
Apply thresholds to further reduce false positives
** SPARK References:**
Export results from ADLS
In this step, we will save the results from previous step as single json file in ADLS. This file can be exported from ADLS to be used with native python session outside spark pool for more data analysis, visualization etc.
Stop Spark Session
Export results from ADLS to local filesystem
Download the files from ADLS
In below sections, we will provide input details about ADLS account ad then use functions to connect , list contents and download results locally.
If you need help in locating input details, follow below steps
Go to the https://web.azuresynapse.net and sign in to your workspace.
In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.
Navigate to folder from the container, right click and select Properies.
Copy ABFSS path , extact the details and map to the input fields
You can check View account access keys doc to find and retrieve your storage account keys for ADLS account.
Warning: If you are storing secrets such as storage account keys in the notebook you should
probably opt to store either into msticpyconfig file on the compute instance or use
Azure Key Vault to store the secrets.
Read more about using KeyVault
in the MSTICPY docs
Display results
Enrich results
In this section, we will enrich entities retrieved from network beaconing behavior such as IP information. Types of Enrichment which will beneficial in perfoming investigation will be IP Geolcation , Whois Registrar information and ThreatIntel lookups.
For first time users, please refer Getting Started Guide For Microsoft Sentinel ML Notebooks and section Create your configuration file to create your mstipyconfig.
IP Geolocation Enrichment
In this step, we will use msticpy geolocation capabilities using maxmind database. You will need maxmind API key to download the database.
Learn more about MSTICPy GeoIP providers...
Whois registration enrichment
In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results.
ThreatIntel Enrichment
In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. Below example shows performing lookup on single IP as well as bulk lookup on all ips using IBM Xforce TI Provider.
You will need to register with IBM Xforce and enter API keys into mstipyconfig.yaml
- More details are shown in the A Tour of Cybersec notebook features notebook
- Threat Intel Lookups in MSTICPy
- To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook
Visualization
MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of destination IPs observed in our results.
Conclusion
We originally started our hunting on very large datasets of firewall logs. Due to the sheer scale of data, we leveraged spark to load the data.
We then performed baselining on historical data and use it to further filter current day dataset. In the next step we performed various data transformation by using spark features such as paritioning, windowing, ranking datatset to find outbound network beaconing like behavior.
In order to analyze this data further, we enrich IP entities from result dataset with additional information such as Geolocation, whois registration and threat intel lookups.
Analysts can perform further investigation on selected IP addresses from enrichment results by correlating various data sources available. You can then create incidents in Microsoft Sentinel and track investigation in it.