GitHub Repository: Azure/Azure-Sentinel-Notebooks
Path: blob/master/scenario-notebooks/Guided Investigation - WAF data.ipynb
³²⁵⁰ views

Kernel: Python 3.8 - AzureML

Investigate Web Application Firewall (WAF) Data

Author: Vani Asawa
Date: December 2020
Notebook Version: 1.0
Python Version: Python 3.6
Required Packages: msticpy, pandas, kqlmagic
Data Sources Required: WAF data (AzureDiagnostics)

What is the purpose of this Notebook?

Web Application Firewall (WAF) data records the monitored and blocked HTTP traffic to and from a web service. Due to the large magnitudes of HTTP requests made to such services in any workspace, the data tends to be incredibly noisy, and hence may prevent an analyst from determining if there are any bad requests made to the servers, which could result in a potentially malicious attack.

This notebook analyses the blocked WAF Alerts and aim to surface any unusual HTTP requests made by the client IPs to the servers, using a variety of statistical techniques applied on several features of the WAF data, such as the Rule ID of the triggering event, the HTTP status code returned to the client from the alerts, and the contents of the request URIs themselves

Overview

Distribution of WAF logs and blocked alerts over an extended time frame

Set an extended time frame to visualise the distribution of the logs/alerts on a bar graph

What is the distribution of WAF blocked alerts over Rule IDs, http-status codes, and client IP entities?

Set a time frame (recommended: time period of interest, after analysing the distribution of alerts in the extended time frame)
Pick a host entity to explore in further detail
Set x and y axes from the variables above, and view the number of alerts over the designate time frame.

Cluster the request URIs in WAF blocked alerts, based on TFIDF scores

Term frequency-inverse document frequency (TFIDF) score is a numerical statistic of how important a variable is to a document. The value of the statistic is directly proportional to the variable's frequency in the document, and inversely proportional to the number of documents that contain the variable. More information about TFIDF can be found here

In our analysis, the variable will be the 'split URIs' and 'rule IDs', while a single document is all the blocked alerts for a single client IP in the selected time frame. We will be assessing the relative importance of every single token of the split request URIs and the number of times a ruleID is triggered for our blocked alerts over multiple such 'documents'. We will be using these two sets of scores to cluster the request URIs, and obtain single/grouped sets of interesting (and potentially malicious) request URIs that were blocked by the WAF.

Compute TFIDF scores based on the following 2 approaches:
- Request URIs split on "/" against the client IP entities
- Number of blocked alerts for every Rule ID against the client IP entities
Visualising the TFIDF scores for both approaches
Performing DBScan Clustering + PCA to obtain the clustered and outlier request URIs for both approaches
KQL query to further examine the WAF logs and blocked alerts in the time frames with outlier request URIs**

Using the Notebook

Prerequisites

msticpy - install the latest using pip install --upgrade msticpy
pandas- install using pip install pandas
kqlmagic

Running the Notebook

The best way of using the notebook is as follows:

Individually run all of the cells up to the start of Section 1:
- Initialization and installation of libraries
- Authenticating to the workspace
- Setting notebook parameters
Default paramenters will allow the entire notebook to run from Section I using the 'Run Selected Cell and All Below' option under the Run tab. However, for added value, run the cells sequentially in any given section.
- At the beginning of each section, set the time parameters. It is recommended that the first and third section have a larger timeframe than the second and fourth sections.
- Wait for the cell to finish running, before proceeding
- Select the options from the widget boxes when displayed and proceed.

In [1]:

from pathlib import Path
import os
import sys
from pathlib import Path
from IPython.display import display, HTML

REQ_PYTHON_VER = "3.10"
REQ_MSTICPY_VER = "2.12.0"
REQ_MP_EXTRAS = ["ml", "kql"]



# If not using Azure Notebooks, install msticpy with
# %pip install msticpy
import msticpy as mp
from msticpy import nbwidgets
from msticpy.nbtools import nbinit
mp.init_notebook(
    namespace=globals(),
    additional_packages=["adjustText", "plotly"]
);

from ipywidgets import widgets
import plotly.graph_objects as go
import plotly.express as px
import re
from sklearn.feature_extraction.text import TfidfVectorizer

%matplotlib inline

from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from adjustText import adjust_text
import itertools
import ipaddress
import traceback

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.max_colwidth', None)

layout = widgets.Layout(width="50%", height="80px")
style = {"description_width": "200px"}

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

In [3]:

# See if we have a Microsoft Sentinel Workspace defined in our config file.
# If not, let the user specify Workspace and Tenant IDs

ws_config = mp.WorkspaceConfig()
if not ws_config.config_loaded:
    ws_config.prompt_for_ws()

qry_prov = mp.QueryProvider(data_environment="AzureSentinel")
print("done")

Out[3]:

Text(value='', description='Enter Tenant ID: ', layout=Layout(height='30px', width='50%'), placeholder='Enter …

Text(value='', description='Enter Workspace ID: ', layout=Layout(height='30px', width='50%'), placeholder='Ent…

In [ ]:

# Authenticate to Microsoft Sentinel workspace
qry_prov.connect(ws_config)

Querying Function : Accessing the results of the Kusto query as a pandas dataframe, and removing empty/null columns from the dataframe

In [3]:

def showQuery(query):
    df = qry_prov.exec_query(query)
    trimDF(df)
    return df

def trimDF(df):
    # Store names of columns with null values for all entries
    empty_null_cols = [col for col in df.columns if df[col].isnull().all()]

    # Store names of columns with empty string '' values for all entries
    empty_str_cols = []
    for col in df.columns:
        try:
            if ''.join(df[col].map(str)) == '':
                empty_str_cols = empty_str_cols + [col]
        except:
            continue

    df.drop(empty_null_cols + empty_str_cols, axis=1, inplace=True)

binIntervals = ['1m', '5m', '10m', '15m', '30m', '1h', '12h', '1d', '5d', '10d']

Selecting a Host

In [4]:

def queryHost(startTime, endTime):
    query = '''
    AzureDiagnostics
    | where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
    | where Category == "ApplicationGatewayFirewallLog"
    | where action_s == 'Blocked' or isempty(action_s)
    | summarize AlertCountPerHost = count() by hostname_s, bin(timeStamp_t, {binInterval})
    | render timechart
    '''.format(startTime = startTime, endTime = endTime, binInterval = '1h')
    return(query)

Auto determine masking bits for clubbing IPs

In [5]:

def maskBitsVal(uniqueIPLen):
    if uniqueIPLen > 150:
        return '/8'
    elif uniqueIPLen > 40:
        return '/16'
    elif uniqueIPLen > 15:
        return '/24'
    return '/32'

Section I: Distribution of WAF logs and blocked alerts over an extended time frame

Select an extended time frame to view the distribution of WAF logs and blocked alerts over all hosts.

In [6]:

query_times_1 = nbwidgets.QueryTime(units='day', max_before=30, before=-15, max_after=-1)
query_times_1.display()

Out[6]:

HTML(value='<h4>Set query time boundaries</h4>')

HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:31:59.0…

VBox(children=(IntRangeSlider(value=(-15, 1), description='Time Range (day):', layout=Layout(width='80%'), max…

In [7]:

categories = ['ApplicationGatewayAccessLog', 'ApplicationGatewayFirewallLog']

def viewLogs(category):
    log_alert_query = '''
    AzureDiagnostics
    | where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
    | where Category == "{category}"
    | where action_s == 'Blocked' or isempty(action_s)
    | summarize NoOfAlerts= count() by bin(timeStamp_t, {binInterval})
    | render timechart '''.format(startTime = query_times_1.start, endTime = query_times_1.end, category = category, binInterval = '1h')

    %kql -query log_alert_query

    rawDataQuery = """
    AzureDiagnostics
    | where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
    | where Category == '{category}'
    | where action_s == 'Blocked' or isempty(action_s)
    | take 15
    """.format(startTime = query_times_1.start, endTime = query_times_1.end, category = category)

    display(showQuery(rawDataQuery).head(5))

category = widgets.Select(options = categories, style = style, layout = layout, description = 'Choose logs/alerts: ')
display(category)

Out[7]:

interactive(children=(Select(description='Choose logs/alerts: ', layout=Layout(height='80px', width='50%'), op…

In [ ]:

viewLogs(category = category.value)

Section II: What is the distribution of blocked WAF alerts over Rule IDs, http-status codes, and client IP Entities?

Select a time frame of interest to view the distribution of WAF blocked alerts over all hosts.

Recommended: Analyse a shorter time frame than Section I for more detail

In [8]:

query_times_2 = nbwidgets.QueryTime(units='day', max_before=30, before=-10, max_after=-1)
query_times_2.display()

Out[8]:

HTML(value='<h4>Set query time boundaries</h4>')

HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:32:15.6…

VBox(children=(IntRangeSlider(value=(-10, 1), description='Time Range (day):', layout=Layout(width='80%'), max…

Select a host entity

The following host entity will be used for the remainder of this section

In [9]:

query = queryHost(query_times_2.start, query_times_2.end)
%kql -query query

try:
    df_host = showQuery(query)
    list_hosts = set([x for x in df_host['hostname_s']])
    df = df_host.groupby(['hostname_s']).agg({'AlertCountPerHost': sum}).rename(columns = {'AlertCountPerHost': 'Num_blocked_alerts'})
    hosts = widgets.Select(options=list_hosts, style = style, layout = layout, value=df['Num_blocked_alerts'].idxmax(), description = 'Select Host: ')
    display(df)
    display(hosts)
except Exception as e:
    print('Error: ' + e)
    traceback.print_exc()

Out[9]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Select(description='Select Host: ', index=4, layout=Layout(height='80px', width='50%'), options=('', '127.0.0.…

Render visualisations of the distribution of blocked alerts for the selected host

We will be using balloon plots to visualise the number of WAF alerts over rule IDs, http-status codes, and client IP entities, for the selected host entity.

In [10]:

query_distribution = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s == "{host}"
| where action_s == 'Blocked' or isempty(action_s)
| join kind=leftouter ( AzureDiagnostics
| where TimeGenerated between  (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayAccessLog"
| summarize by requestUri_s, httpStatus_d
) on requestUri_s
| summarize NoOfAlerts = count(), make_set(requestUri_s), DistinctURIs = dcount(requestUri_s) by clientIp_s, ruleId_s, httpStatus_d1
'''.format(startTime = query_times_2.start, endTime = query_times_2.end, host = hosts.value)

try:
    df_distribution = showQuery(query_distribution)
    df_distribution.rename(columns = {'clientIp_s':'Ip Address', 'ruleId_s':'Rule ID', 'set_requestUri_s': 'Request Uris'}, inplace = True)

    if 'httpStatus_d1' in df_distribution.columns:
        df_distribution = df_distribution.sort_values(by=['httpStatus_d1'], ascending = True).reset_index(drop = True)
        df_distribution.rename(columns = {'httpStatus_d1':'Http status'}, inplace = True)
        df_distribution['Http status'] = 'h: ' + df_distribution['Http status'].astype(str)

    maskBits = maskBitsVal(len(df_distribution['Ip Address'].unique()))
    df_distribution['Ip Address'] = df_distribution['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))

    df_distribution['Ip Address'], df_distribution['Rule ID'] =  'Ip ' + df_distribution['Ip Address'].astype(str), 'rID ' + df_distribution['Rule ID'].astype(str)
except Exception as e:
    print('Error: ' + e)
    traceback.print_exc()

Out[10]:

<IPython.core.display.Javascript object>

In [11]:

options = ['Ip Address', 'Rule ID']
if 'Http status' in df_distribution.columns:
    options += ['Http status']

def viewBalloonPlot(x_axis, y_axis, display_rawResult):
    try:
        df_balloon_plot = (df_distribution
                  .groupby([x_axis, y_axis], as_index=False)
                  .agg({'NoOfAlerts': sum, 'DistinctURIs': sum, 'Request Uris': list})
                  .reset_index(drop = True))
        fig = px.scatter(df_balloon_plot,   x=df_balloon_plot[x_axis], y = df_balloon_plot[y_axis],
                     size= np.log(1 + df_balloon_plot['NoOfAlerts'] ), color = 'NoOfAlerts',
                     hover_data=['NoOfAlerts', 'DistinctURIs'])

        fig.update_layout(height = max(300, 30 * len(set(df_balloon_plot[y_axis]))), title_text='Alert Distribution for host ID '+ str(hosts.value))

        fig.show()
        if display_rawResult == 'Yes':
            print('Top 5 raw results with the highest number of alerts: \n')
            df_balloon_plot['Request Uris'] = [np.unique(list(itertools.chain(*row['Request Uris']))) for index, row in df_balloon_plot.iterrows() ]
            df_balloon_plot['DistinctURIs'] = df_balloon_plot['Request Uris'].str.len()
            display(df_balloon_plot[[y_axis, x_axis, 'NoOfAlerts','Request Uris', 'DistinctURIs']].sort_values(by='NoOfAlerts', ascending = False).head(5))
    except ValueError:
        print('ValueError: Choose distinct x and y axes')
    except Exception as e:
        print('Error: ' + e)
        traceback.print_exc()


x_axis = widgets.Select(options = options, style = style, layout = layout, description = 'Select x-axis: ')
y_axis = widgets.Select(options = options, style = style, layout = layout, description = 'Select y-axis: ')
display_rawResult = widgets.Select(options = ['Yes', 'No'], description = 'Display raw results: ')

md("Select graph properties:", "bold")
display(x_axis)
display(y_axis)
display(display_rawResult)

Out[11]:

interactive(children=(Select(description='Select x-axis: ', layout=Layout(height='80px', width='50%'), options…

In [ ]:

viewBalloonPlot, x_axis = x_axis.value,
                   y_axis = y_axis.value, display_rawResult = display_rawResult.value)

display(w)

Section III: Cluster the request URIs in blocked WAF Alerts, based on TFIDF scores

Select the timeframe and host entity for this section of the notebook.

Recommended: Set a timeframe of >20 days

In [12]:

query_times_3 = nbwidgets.QueryTime(units='day', max_before=30, before=10, max_after=-1)
query_times_3.display()

Out[12]:

HTML(value='<h4>Set query time boundaries</h4>')

HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:32:42.8…

VBox(children=(IntRangeSlider(value=(-10, 1), description='Time Range (day):', layout=Layout(width='80%'), max…

In [13]:

df_host_2 = showQuery(queryHost(query_times_3.start, query_times_3.end))
df = df_host_2.groupby(['hostname_s']).agg({'AlertCountPerHost': sum}).rename(columns = {'AlertCountPerHost': 'Num_blocked_alerts'})
hosts_2 = widgets.Select(options=set([x for x in df_host_2['hostname_s']]), value=df['Num_blocked_alerts'].idxmax(), description = 'Select Host: ')
display(df)
display(hosts_2)

Out[13]:

<IPython.core.display.Javascript object>

Select(description='Select Host: ', index=6, options=('', '127.0.0.1', '13.89.108.163:80', '112.124.42.80:6343…

Enter min_df and max_df value parameters

min_df: The min_df variable is used to eliminate terms that do not appear very frequently in our data. A min_df value of 0.01 implies eliminating terms that apear in less than 1% of the data.

max_df: The max_df variable eliminates terms that appear very frequently in our data. A max_df value of 0.9 implies eliminating terms that appear in more than 90% of the data.

For more information about these parameters in the TFIDF vectorizer, please see here

Note: In the case of errors running the code below for the two approaches (Request URIs split on "/" against the client IP entities OR Number of blocked alerts for every Rule ID against the client IPs), run the TFIDF vectoriser for ALL the data

If you would like to view the TFIDF scores for all the data, change the following code in the tfidfScores function:

vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df = min_df_value, max_df = max_df_value)

vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False)

In [13]:

min_df_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Enter min_df: ', placeholder = '% or Integer or None', value = '0.01')
max_df_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Enter max_df: ', placeholder = '% or Integer or None', value = '0.9')

display(min_df_widget)
display(max_df_widget)

Out[13]:

Text(value='0.01', description='Enter min_df: ', layout=Layout(height='30px', width='50%'), placeholder='% or …

Text(value='0.9', description='Enter max_df: ', layout=Layout(height='30px', width='50%'), placeholder='% or I…

In [15]:

try:
    min_df_value = float(min_df_widget.value)
    max_df_value = float(max_df_widget.value)
except Exception as e:
    print('Error: ' + str(e))
    traceback.print_exc()

def tfidfScores(df, tokenList = None):
    def identity_tokenizer(text):
        return text

    vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df = min_df_value, max_df = max_df_value)
    vectors = vectorizer.fit_transform(tokenList)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df_scores = pd.DataFrame(denselist, columns = feature_names)
    multicol1 = pd.MultiIndex.from_tuples([('weight', str(j)) for j in df_scores.columns])
    df_multiIndex = pd.DataFrame([list(df_scores.iloc[i]) for i in range(0, len(df_scores))], index=[df['Ip Address']], columns=multicol1)
    return df_multiIndex

Approach I: Compute TFIDF scores for split request URIs in the blocked WAF Alerts against client IP entities

In [16]:

query_URIs = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s startswith "{host}"
| where action_s == 'Blocked' or isempty(action_s)
| distinct clientIp_s, requestUri_s
| summarize make_list(requestUri_s) by clientIp_s
'''.format(startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value)

try:
    df_URIs = showQuery(query_URIs)
    df_URIs.rename(columns = {'clientIp_s':'Ip Address', 'list_requestUri_s': 'RequestUris'}, inplace = True)

    viewData_splitUri = df_URIs.copy()
    maskBits = maskBitsVal(len(viewData_splitUri['Ip Address'].unique()))
    viewData_splitUri['Ip Address'] = viewData_splitUri['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))
    viewData_splitUri.groupby(["Ip Address"], as_index=False).agg({'RequestUris': list})

    tokenList = []
    for index, row in viewData_splitUri.iterrows():
        splitUris = re.split('/', ''.join(row['RequestUris']))
        tokenList = tokenList + [splitUris]

    df_splitUri_tfidf = tfidfScores(viewData_splitUri, tokenList)
except Exception as e:
    print('Error: ' + str(e))
    traceback.print_exc()

Out[16]:

<IPython.core.display.Javascript object>

Approach II: Computer TFIDF scores for volume of blocked WAF alerts for Rule Ids against the client IP entities

In [17]:

query_RuleIds = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s startswith "{host}"
| where action_s == 'Blocked'
| summarize alertCount = count(), make_set(requestUri_s) by clientIp_s, ruleId_s
'''.format(startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value)

try:
    dfPrac = showQuery(query_RuleIds)
    df_RuleIds = showQuery(query_RuleIds)
    df_RuleIds.rename(columns = {'clientIp_s':'Ip Address', 'ruleId_s':'RuleId', 'set_requestUri_s': 'RequestUris'}, inplace = True)

    maskBits = maskBitsVal(len(df_RuleIds['Ip Address'].unique()))
    df_RuleIds['Ip Address'] = df_RuleIds['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))

    viewData_ruleId = df_RuleIds.groupby(["Ip Address"], as_index=False).agg({'RuleId': list, 'alertCount': list, 'RequestUris': list})
    tokenList = [sum([[s] * n for s, n in zip(viewData_ruleId['RuleId'][x], viewData_ruleId['alertCount'][x])], []) for x in range(0, len(viewData_ruleId))]

    df_ruleId_tfidf = tfidfScores(viewData_ruleId, tokenList)
except Exception as e:
    print('Error: ' + e)
    traceback.print_exc()

Out[17]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Visualisation of the TFIDF scores for both approaches

We will be using balloon plots to view the TFIDF scores for the two approaches

In [32]:

options = ['RuleId', 'SplitUris']

def visualiseTFIDF(TfidfCategory):
    try:
        max_category = 30
        df = pd.DataFrame()
        if TfidfCategory == 'RuleId': df = df_ruleId_tfidf.copy()
        else:
            df = df_splitUri_tfidf.copy()

        df_tfidf = df.iloc[:, : max_category].stack().reset_index(drop = False).rename(columns = {'level_1':TfidfCategory, 'weight':'tfidf'})
        df_tfidf['Ip Address'] = 'Ip ' + df_tfidf['Ip Address'].astype(str)
        if 'RuleId' == TfidfCategory:
            df_tfidf['RuleId'] = 'rID ' + df_tfidf['RuleId'].astype(str)
        else:
            df_tfidf['SplitUris'] = df_tfidf['SplitUris'].apply(lambda x:  (x[0:20]+ '...') if len(x)> 20 else x)

        fig = px.scatter(df_tfidf,  x = df_tfidf[TfidfCategory], y = df_tfidf['Ip Address'],
                     size= np.log(1 + df_tfidf['tfidf']), color = df_tfidf['tfidf'],
                     hover_data=[df_tfidf['tfidf']])
        fig.update_layout(height = max(800, 20 * len(set(df_tfidf[TfidfCategory]))), title_text= 'TFIDF distribution of ' +  TfidfCategory + ' against client IPs', width = 1700)
        fig.show()
    except Exception as e:
        print('Error: ' + e)
        traceback.print_exc()
TfidfCategory = widgets.Select(options = options, style = style, layout = layout, description = 'TFIDF approach: ')
display(TfidfCategory)

Out[32]:

interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…

In [ ]:

visualiseTFIDF(TfidfCategory = TfidfCategory.value)

DBSCAN Clustering and PCA of the request URIs for both approaches

DBSCAN is a non-parametric density-based spatial clustering algorithm, which groups together points that are "closely packed" together. Points which lie in low density regions are marked as outliers. For more information, please see here. We use DBScan on our data in order to aggregate request URIs which are similar to each other, and surface unusual request URIs as outliers. The clustering uses the Tfidf scores data obtained for the rule ID and split URIs approaches respectively.

Select the eps and min_samples value for DBScan and n_components value for PCA below. More information about these parameters can be found here and here.

DBScan:

eps value: Eps value is a measure of the distance below which two points are considered neighbors.

min_samples: The minimum number of neighbors that a point should have in order to be classified as a core point. The core point is included in the min_samples count.

PCA: PCA is a dimensionality reduction technique that compresses the multivariate data into principal components, which describe most of the variation in the original dataset. In our case, we are able to better visualise the clubbing of similar and outlier request URIs by visualising the first two Principal components.

n_components: Number of principal components

In [15]:

eps_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'DBSCAN - Enter eps value', value = '0.4')
min_samples_widget = widgets.IntSlider(style = style, layout = widgets.Layout(width="50%", height="30px"), description='DBSCAN - Enter min samples', start=1, end=15, step=1, value=5)
n_components_widget = widgets.IntSlider(style = style, layout = widgets.Layout(width="50%", height="30px"), description='PCA - Enter n_components', start=1, end=15, step=1, value=2)

display(eps_widget)
display(min_samples_widget)
display(n_components_widget)

Out[15]:

Text(value='0.4', description='DBSCAN : Enter eps value', layout=Layout(height='30px', width='50%'), style=Des…

IntSlider(value=5, description='DBSCAN : Enter min samples', layout=Layout(height='30px', width='50%'), style=…

IntSlider(value=2, description='PCA : Enter n_components', layout=Layout(height='30px', width='50%'), style=Sl…

In [12]:

def db_scan_clustering(data, eps = float(eps_widget.value)):
    dbscan = DBSCAN(eps=eps, min_samples = int(min_samples.value))
    dbscan.fit(data)
    return dbscan.labels_

def principal_component_analysis(data, eps = float(eps_widget.value)):
    while True:
        try:
            pca = PCA(n_components=int(n_components_widget.value))
            pca.fit(data)
            x_pca = pca.transform(data)
            break
        except:
            continue
    clusters = db_scan_clustering(data.values, eps)

    label = list(range(0, len(data), 1))
    plt.figure(figsize=(20,15))
    scatter = plt.scatter(x_pca[:,0],x_pca[:,1],c = clusters,cmap='rainbow')
    handles, labels = scatter.legend_elements(prop="colors", alpha=0.6)
    plt.legend(handles, labels, loc="upper right", title="Clusters")
    n = list(range(0, len(x_pca[:,0]), 1))
    texts = []
    for i, txt in enumerate(n):
        texts.append(plt.text(x_pca[:,0][i], x_pca[:,1][i], txt))
    adjust_text(texts)

    plt.show()

In [20]:

options1 = ['RuleId', 'SplitUris']

def viewPCA(tfidfCategory):
    df = df_splitUri_tfidf.copy()
    viewData = viewData_splitUri.copy()
    if tfidfCategory == 'RuleId':
        df = df_ruleId_tfidf.copy()
        viewData = viewData_ruleId.copy()

    print(tfidfCategory + ' approach (Outliers + Clustered request URI data): \n')
    while True:
        try:
            principal_component_analysis(df)
            break
        except:
            continue

print(color.BOLD + 'Principal Component Analysis \n' + color.END)
tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: ')
display(tfidfCategory)

Out[20]:

Principal Component Analysis 

interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…

In [ ]:

viewPCA(tfidfCategory = tfidfCategory.value)

In [21]:

options1 = ['RuleId', 'SplitUris']
options2 = ['Outlier', 'Clustered']

def viewClusters(tfidfCategory, requestURIs):
    try:
        df = df_splitUri_tfidf.copy()
        viewData = viewData_splitUri.copy()
        if tfidfCategory == 'RuleId':
            df = df_ruleId_tfidf.copy()
            viewData = viewData_ruleId.copy()

        clusters = db_scan_clustering(df.values)
        print(requestURIs + ' URIs for ' + tfidfCategory+ ': \n')

        clusterList = list(set(clusters))
        try:
            clusterList.remove(-1)
        except:
            print()

        if requestURIs == 'Outlier':
            clusterList = [-1]

        if clusterList:
            for k in clusterList:
                print('Cluster ' + str(k))
                display(viewData[viewData['Ip Address'].isin(df.index.get_level_values(0)[clusters == k])])
        else:
            print('No Data')
    except Exception as e:
        print('Error: ' + e)
        traceback.print_exc()

print(color.BOLD + 'DBScan Clustering of the Request URIs \n' + color.END)
tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: ')
requestURIs = widgets.Select(options = options2, style = style, layout = layout, description = 'Request URIs: ')
display(tfidfCategory)
display(requestURIs)

Out[21]:

DBScan Clustering of the Request URIs 

interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…

In [ ]:

viewClusters, tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: '), requestURIs = widgets.Select(options = options2, style = style, layout = layout, description = 'Request URIs: ') ))

Kusto query to further examine the WAF logs and blocked alerts in the time frames with outlier request URIs

In [22]:

ipAddress = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'IP address: ', placeholder = 'Enter masked IP address from the results above. Include masking bits.')
requestURI = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Request URI: ', placeholder = 'Enter request URI from the results above')

print(color.BOLD + '\nStart time: ' + color.END + str(query_times_3.start) + '\n')
print(color.BOLD + 'End time: ' + color.END + str(query_times_3.end) + '\n')

display(ipAddress)
display(requestURI)

Out[22]:

Start time: 2020-11-04 10:32:42.885697

End time: 2020-11-26 10:32:42.885697

Text(value='', description='IP address: ', layout=Layout(height='30px', width='50%'), placeholder='Enter maske…

Text(value='', description='Request URI: ', layout=Layout(height='30px', width='50%'), placeholder='Enter requ…

In [23]:

try:
    pd.set_option('display.max_colwidth', 20)
    kql_query = '''
    AzureDiagnostics
    | where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
    | where Category == "{category}"
    | where {hostname} startswith "{host}"
    | where action_s == 'Blocked' or isempty(action_s)
    | where {ip} startswith "{ipaddress}"
    | extend originalRequestUriWithArgs_s = column_ifexists("originalRequestUriWithArgs_s", "")
    | where requestUri_s contains {uri} or originalRequestUriWithArgs_s contains {uri}
    | take 10
    '''
    cutOff = [1, 2, 3, 4]
    intlist = [8, 16, 24, 32]

    if ipAddress.value != '':
        ipaddress = str(ipAddress.value).strip().split('/')[0]
        maskBits =  int(str(ipAddress.value).strip().split('/')[1])
        ipaddress = '.'.join(ipaddress.split('.')[0:cutOff[intlist.index(maskBits)]])
    else:
        ipaddress = ''

    print(color.BOLD + '\nStart time: ' + color.END + str(query_times_3.start) + '\n')
    print(color.BOLD + 'End time: '+ color.END + str(query_times_3.end) + '\n')

    print(color.BOLD + 'Ip Address entered: ' + color.END + str(ipAddress.value) + '\n')
    print(color.BOLD  + 'Request Uri entered: ' + color.END + str((requestURI.value).strip()) + '\n' )

    category = 'ApplicationGatewayAccessLog'
    ip_var = 'clientIP_s'
    host_var = 'host_s'
    uri = '\'' + (requestURI.value).strip() +  '\''
    kql_accessLogs = kql_query.format(hostname = host_var, startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value, category = category, ip = ip_var, ipaddress = ipaddress, uri = uri)
    df_rawAccessKustoQuery = showQuery(kql_accessLogs)
    print(category + ' (Raw) Data- \n')
    display(df_rawAccessKustoQuery.head(10))

    category = 'ApplicationGatewayFirewallLog'
    ip_var = 'clientIp_s'
    host_var = 'hostname_s'
    uri = '@' + '\'' + (requestURI.value).strip() +  '\''
    kql_firewallLogs = kql_query.format(hostname = host_var, startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value, category = category, ip = ip_var, ipaddress = ipaddress, uri = uri,)
    df_rawFirewallKustoQuery = showQuery(kql_firewallLogs)
    print(category + ' (Alert) Data- \n')
    display(df_rawFirewallKustoQuery.head(10))
    pd.reset_option('max_colwidth')

except Exception as e:
    print('Error: ' + str(e))
    traceback.print_exc()

Out[23]:

Start time: 2020-11-04 10:32:42.885697

End time: 2020-11-26 10:32:42.885697

Ip Address entered: 108.4.0.0/16

Request Uri entered: \\xcc\\xb2\\xcc\\x85]-1572603645543.jpg

<IPython.core.display.Javascript object>

ApplicationGatewayAccessLog (Raw) Data- 

<IPython.core.display.Javascript object>

ApplicationGatewayFirewallLog (Alert) Data- 

In [ ]: