Path: blob/master/scenario-notebooks/Guided Investigation - WAF data.ipynb
3250 views
Investigate Web Application Firewall (WAF) Data
Author: Vani Asawa
Date: December 2020
Notebook Version: 1.0
Python Version: Python 3.6
Required Packages: msticpy, pandas, kqlmagic
Data Sources Required: WAF data (AzureDiagnostics)
What is the purpose of this Notebook?
Web Application Firewall (WAF) data records the monitored and blocked HTTP traffic to and from a web service. Due to the large magnitudes of HTTP requests made to such services in any workspace, the data tends to be incredibly noisy, and hence may prevent an analyst from determining if there are any bad requests made to the servers, which could result in a potentially malicious attack.
This notebook analyses the blocked WAF Alerts and aim to surface any unusual HTTP requests made by the client IPs to the servers, using a variety of statistical techniques applied on several features of the WAF data, such as the Rule ID of the triggering event, the HTTP status code returned to the client from the alerts, and the contents of the request URIs themselves
Overview
Distribution of WAF logs and blocked alerts over an extended time frame
Set an extended time frame to visualise the distribution of the logs/alerts on a bar graph
Set a time frame (recommended: time period of interest, after analysing the distribution of alerts in the extended time frame)
Pick a host entity to explore in further detail
Set x and y axes from the variables above, and view the number of alerts over the designate time frame.
Cluster the request URIs in WAF blocked alerts, based on TFIDF scores
Term frequency-inverse document frequency (TFIDF) score is a numerical statistic of how important a variable is to a document. The value of the statistic is directly proportional to the variable's frequency in the document, and inversely proportional to the number of documents that contain the variable. More information about TFIDF can be found here
In our analysis, the variable will be the 'split URIs' and 'rule IDs', while a single document is all the blocked alerts for a single client IP in the selected time frame. We will be assessing the relative importance of every single token of the split request URIs and the number of times a ruleID is triggered for our blocked alerts over multiple such 'documents'. We will be using these two sets of scores to cluster the request URIs, and obtain single/grouped sets of interesting (and potentially malicious) request URIs that were blocked by the WAF.
Compute TFIDF scores based on the following 2 approaches:
Request URIs split on "/" against the client IP entities
Number of blocked alerts for every Rule ID against the client IP entities
Visualising the TFIDF scores for both approaches
Performing DBScan Clustering + PCA to obtain the clustered and outlier request URIs for both approaches
KQL query to further examine the WAF logs and blocked alerts in the time frames with outlier request URIs**
Using the Notebook
Prerequisites
msticpy - install the latest using pip install --upgrade msticpy
pandas- install using pip install pandas
kqlmagic
Running the Notebook
The best way of using the notebook is as follows:
Individually run all of the cells up to the start of Section 1:
Initialization and installation of libraries
Authenticating to the workspace
Setting notebook parameters
Default paramenters will allow the entire notebook to run from Section I using the 'Run Selected Cell and All Below' option under the Run tab. However, for added value, run the cells sequentially in any given section.
At the beginning of each section, set the time parameters. It is recommended that the first and third section have a larger timeframe than the second and fourth sections.
Wait for the cell to finish running, before proceeding
Select the options from the widget boxes when displayed and proceed.
Querying Function : Accessing the results of the Kusto query as a pandas dataframe, and removing empty/null columns from the dataframe
Selecting a Host
Auto determine masking bits for clubbing IPs
Select a host entity
The following host entity will be used for the remainder of this section
Render visualisations of the distribution of blocked alerts for the selected host
We will be using balloon plots to visualise the number of WAF alerts over rule IDs, http-status codes, and client IP entities, for the selected host entity.
Enter min_df and max_df value parameters
min_df: The min_df variable is used to eliminate terms that do not appear very frequently in our data. A min_df value of 0.01 implies eliminating terms that apear in less than 1% of the data.
max_df: The max_df variable eliminates terms that appear very frequently in our data. A max_df value of 0.9 implies eliminating terms that appear in more than 90% of the data.
For more information about these parameters in the TFIDF vectorizer, please see here
Note: In the case of errors running the code below for the two approaches (Request URIs split on "/" against the client IP entities OR Number of blocked alerts for every Rule ID against the client IPs), run the TFIDF vectoriser for ALL the data
If you would like to view the TFIDF scores for all the data, change the following code in the tfidfScores function:
vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df = min_df_value, max_df = max_df_value)
to
vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False)
Approach I: Compute TFIDF scores for split request URIs in the blocked WAF Alerts against client IP entities
Approach II: Computer TFIDF scores for volume of blocked WAF alerts for Rule Ids against the client IP entities
Visualisation of the TFIDF scores for both approaches
We will be using balloon plots to view the TFIDF scores for the two approaches
DBSCAN Clustering and PCA of the request URIs for both approaches
DBSCAN is a non-parametric density-based spatial clustering algorithm, which groups together points that are "closely packed" together. Points which lie in low density regions are marked as outliers. For more information, please see here. We use DBScan on our data in order to aggregate request URIs which are similar to each other, and surface unusual request URIs as outliers. The clustering uses the Tfidf scores data obtained for the rule ID and split URIs approaches respectively.
Select the eps and min_samples value for DBScan and n_components value for PCA below. More information about these parameters can be found here and here.
DBScan:
eps value: Eps value is a measure of the distance below which two points are considered neighbors.
min_samples: The minimum number of neighbors that a point should have in order to be classified as a core point. The core point is included in the min_samples count.
PCA: PCA is a dimensionality reduction technique that compresses the multivariate data into principal components, which describe most of the variation in the original dataset. In our case, we are able to better visualise the clubbing of similar and outlier request URIs by visualising the first two Principal components.
n_components: Number of principal components
Principal Component Analysis
DBScan Clustering of the Request URIs
Kusto query to further examine the WAF logs and blocked alerts in the time frames with outlier request URIs
Start time: 2020-11-04 10:32:42.885697
End time: 2020-11-26 10:32:42.885697
Start time: 2020-11-04 10:32:42.885697
End time: 2020-11-26 10:32:42.885697
Ip Address entered: 108.4.0.0/16
Request Uri entered: \\xcc\\xb2\\xcc\\x85]-1572603645543.jpg