Path: blob/master/scenario-notebooks/Guided Hunting - Use Machine Learning to Detect Potential Low and Slow Password Sprays using Apache Spark via Azure Synapse.ipynb
3250 views
Guided Hunting - Use Machine Learning to Detect Potential Low and Slow Password Sprays using Apache Spark via Azure Synapse
Notebook Version: 1.0
Python Version: Python 3.8
Required Packages: azureml-synapse, msticpy, azure-storage-file-datalake
Platforms Supported: Azure Machine Learning Notebooks connected to Azure Synapse Workspace
Data Source Required: Yes
Data Source: SigninLogs
Spark Version: 3.1 or above
Description
This guided hunting notebook leverages machine learning to tackle the difficult problem of detecting low and slow password spray campaigns (This augments more broad-scoped password spray detection already provided via Microsoft’s Identity Protection Integration for Sentinel.) We leverage the built-in parallelism of PySpark and MLlib (via the Azure Synapse linked service) to ingest, query and analyse data at scale.
Low and slow sprays are a variant on traditional password spray attacks that are being increasingly used by sophisticated adversaries. These adversaries can randomize client fields between each sign in attempt, including IP addresses, user agents and client application and are often willing to let the password spray campaigns run at a very low frequency over a period of months or years, making detection very challenging. A key observation that we exploit in this noteboo is the fact that, within a single campaign, attackers often randomize the same large number of properties simultaneously, resulting in a group of logins occurring periodically over a long period of time with same set of anomalous properties.
This notebook runs through the following ML-driven approach to surfacing potential low and slow sprays. (For more details on the approach see the accompanying Microsoft Tech Community blog post: Microsoft Sentinel Blog - Microsoft Tech Community.)
Detect anomalous fields for each failed sign-in attempt using successful sign-ins as a baseline
Use ML to cluster failed sign-ins by the columns which were randomized/anomalous
Prune the clusters from the previous step based on knowledge of what a low and slow spray looks like; for example, by removing clusters in which sign-ins do not occur at a steady frequency over an extended period of time
Further analyze the candidate password spray clusters (using threat intelligence enrichments from msticpy, for example), to find any invariant properties within the clusters
Identify any successful sign-ins that follow the patterns observed for each cluster from the previous step and create Sentinel incidents as appropriate
Related MITRE ATT&CK techniques:
Pre-Requisites
This notebook also makes use of the Azure Synapse integration for Sentinel notebooks. To set up the Synapse integration, please use the notebook Configurate Azure ML and Azure Synapse Analytics.
Ensure that the
bayespy ~= 0.5.22Python pacakge is installed on your Spark pool. You can do this by uploading arequirements.txtfile as detailed in the docs.Ensure that Sentinel SigninLogs data has been exported to an appropriate ADLS storage container. To export the necessary data
Set up a continuous log export rule
Do a one-time export of historical data
: A walkthrough of the one-time export of historical log data is available in a TechCommunity blog post here: Export Historical Data from Log Analytics (microsoft.com).
The template notebook is available via the Sentinel UI or on GitHub: Export Historical Log Data (GitHub).
Python modules may need to be downloaded. Please run the cells sequentially to avoid errors. Please do not use "run all cells".
Table of Contents
Warm-up
Authentication to Azure Resources
Configure Azure ML and Azure Synapse Analytics
Load the Data
Data Cleansing using Spark
Data Science using Spark
Enriching the Results
Conclusion
1. Setup
Install Packages
Note: Install below packages only for the first time and restart the kernel once done.
*** ***
Initialize msticpy
The nbinit module loads required libraries and optionally installs required packages.
Configure Azure ML and Azure Synapse Analytics
If you haven't previously set up the Synapse linked service for AzureML, please use the notebook, Configurate Azure ML and Azure Synapse Analytics, to do so. The notebook will configure an existing Azure Synapse workspace to create and connect to Spark pool. You can then create linked service and connect the AML workspace to the Azure Synapse workspace.
You will also need to ensure that the bayespy ~= 0.5.22 Python pacakge is installed on your Spark pool. You can do this by uploading a requirements.txt file as detailed in the docs.
Authentication to Azure Resources
We now connect the AML workspace to the Azure Synapse workspace using the linked service.
Note: Specify the input parameters in below step in order to connect to the Spark attached compute.
Start Spark Session
Enter your Synapse Spark compute below. To view details of available Spark computes in the AML UI, please follow these steps:
On the AML Studio left menu, navigate to Linked Services
Click on the name of the Link Service you want to use
Select Spark pools tab
Note: The Python contexts for the AML notebooks session and the Spark session are separate - this means that Python all variables defined using the
%%synapsecell magic are not available in the AML notebook session and vice-versa.
In order to work with months or years of data in an efficient, scalable way, we make use of Spark's native multi-executor paralellism. The code in this notebook will scale to any number of nodes, though the optimal performance-vs-cost balance will depend on the volume of your data - 10 executors may be a reasonable starting point. (See pricing details.)
Note: Make sure you have selected you Synapse Spark compute from the drop down in the pervious cell before running the cell below
Now we start the Spark session with the configuration options selected above.
Note: You can also use the Synapse line/cell magic to start a session if you do not need to expand variables in your spark configuration - e.g.
%synapse start -s $subscription_id -w $amlworkspace -r $resource_group -c $synapse_spark_computeMore details are here: RemoteSynapseMagics class - Azure Machine Learning Python | Microsoft Docs
2. Run ML on Azure Synapse Spark
Overview of ML Approach
Our novel ML approach begins with the observation that attackers often randomize the same large number of properties simultaneously, resulting in a group of logins occurring periodically over a long period of time with same set of anomalous properties. Thus, we can attempt to cluster failed sign-ins (most password spray sign-in attempts will fail!) based on the set of properties that are anomalous.
We use a naive Bayes approach to estimate the likelihood of any given peroperty value ocurring for a legitimate sign-in and then use outlier detection to highlight unlikely values as being "anomalous". This gives a dataset in which the rows comprise a (failed) sign-in ID and boolean flags for each sign-in property denoting whether or not that property took an anomalous value. We model this scenario as a multivariate Bernoulli mixture model, and perform variational Bayesian inference to detect the presence of latent classes which will be our candidates for low and slow password spray campaigns. Later, we filter these candidate low and slow clusters by computing various statistics (such as the uniformity of the time-distribution of the sign-ins) and comparing these against what we would expect from a low and slow password spray.
For more details, see the accompanying Microsoft Tech Community blog post: Microsoft Sentinel Blog - Microsoft Tech Community.)
The overall approach looks like this:
Detect anomalous fields for each failed sign-in attempt using successful sign-ins as a baseline
Cluster failed sign-ins by the columns which were randomized/anomalous
Prune the clusters from the previous step based on knowledge of what a low and slow spray looks like; for example, by removing clusters in which sign-ins do not occur at a steady frequency over an extended period of time
Further analyze the candidate password spray clusters (using threat intelligence enrichments from msticpy, for example), to find any invariant properties within the clusters
Identify any successful sign-ins that follow the patterns observed for each cluster from the previous step and create Sentinel incidents as appropriate
Having started the Spark session, we can run PySpark code by starting a cell with the %%synapse line magic.
Spark and MLlib are written with efficient parallelisation in mind, meaning that data ETL, analysis and ML is hugely distributed by default, allowing for highly scalable workloads.
SPARK and MLlib References:
We start by importing the packages we will need for the ML into the current session.
Note: The Python contexts for the AML notebooks session and the Spark session are separate - this means that Python packages imported using the
%%synapsecell magic are not imported into the AML notebook session and vice-versa.
Load Data
Fill in the location details for the ADLS container to which the Sentinel SigninLogs are exported.
We also specify how much data we want to use with the ML algorithm by specifying an end date and a number of lookback days. Keep in mind that low and slow password sprays take place over long periods (typically months or even years).
You will also need to ensure that sufficient historical log data is actually available in ADLS.
The information from the above cell is used to detemine the ALDS paths for the data we want to load (based on the partition scheme used by the "continuous data export" tool in Sentinel).
Now we can read the data into a Spark dataframe. It is worth noting that, since the exported log data comprises a separate data file for each 5-minute partition, we may be reading from over 100,000 files. Therefore, you may wish to increase the maximum number of executors available to the Azure Synapse Spark session - this will allow this operation to be massively parallelized automatically, dramatically reducing time taken.
Feature Selection
Here, we also specify the columns that we want to read into the Spark dataframe. The list suggested below comprises some core sign in properties - Id, UserPrincipalName, ResultType, TimeGenerated - and some additional properties (which we refer to as "features" for the ML).
The features below have been selected to help spot behaviors that make password sprays stand out, e.g.
Features (properties) that an attacker is able to randomise (e.g. IP addresses, location details, user agent-derived fields)
Features (properties) where the "normal" values are concealed from attackers (so are hard for an attacker to guess) (e.g. operating system (included in DeviceDetail), browser, city)
(Some features fall into both categories)
Data Wrangling using Spark
Filtering data
We start by filtering the data set by result types, keeping result types:
0 (successful sign in)
50055 (expired password)
50126 (incorrect username or password)
The latter two failure errors are the ones most commonly observed as part of password sprays.
See Azure AD Authentication and authorization error codes for more details.
Deduplication
Exported logs may occasionally contain a small amount of duplication either due to the way in which they are collected or due to the data export process (see data completeness for exported logs). In general, duplicate rows should be removed prior to analysis, but in some cases, you may decide to postpone or omit de-duplication if duplicated rows are unlikely to impact your detection logic (especially as de-duping can be a very expensive operation depending on the size of your dataframe and the number of columns that comprise a unique key).
Data Parsing and Extration
In this step, we
Create a new column containing the IP prefix (if IP ASN is available, prefer to use this instead)
Extract the "browser", "displayName" and "operatingSystem" fields from the "DeviceDetail" JSON column
Extract the "city", "state", "longitude" and "latitude" fields from the "LocationDetails" JSON column
Feature Encoding
We now one-hot encode our categorical features using Spark's MLlib (stringing together the StringIndexer transform followed by the OneHotEncoder transform).
First we use the StringIndexer class to map the categorical feature columns to columns of category indices. For each column, the indices run from 0 to the number of distinct values observed.
At this stage, we also split our dataframe into two: one conatining successful sign-ins and the other containing failed sign-ins. Doing this here will be helpful later on.
Finally, we use OneHotEncoder class to convert our ordinal-encoded columns of category indices to one-hot binary vectors.
Detect Anomalous Fields for Each Failed Sign-In
The first step is to apply anomaly detection to each column of each failed sign in - we want to end up with a table that looks like this:
| Sign-in ID | Is Country anomalous? | Is City anomalous? | Is OS Anomalous? | Is Browser anomalous? | Is App Display Name anomalous? | etc. |
|---|---|---|---|---|---|---|
| 1 | True | True | False | False | False | ... |
| 2 | False | False | False | True | True | ... |
| 3 | False | True | True | False | False | ... |
We model each of our features as categorical random variables with the categories being the set of unique values observed from all sign in attempts (both successful and failed). We then use Bayesian parameter estimation with the set of successful sign ins to learn the true distributions for "good” (i.e. non-malicious) sign-in attempts. (Since we obviously don’t have perfect good vs. malicious labels for all sign-ins, we are using successful sign-ins as a proxy for good sign-ins).
Mathematically, we model the features as independent categorical variables with symmetric Dirichlet priors with concentration parameter, . This leads us to estimate the probability of feature taking value as
where is the number of times that feature takes the value in the dataset of successful sign-ins, is the total number of successful sign-ins, and is the number of available categories of feature (as observed from both successful and failed sign-ins).
Here, acts as a smoothing parameter (see Additive/Laplace smoothing) - increasing will cause the algorithm to classify fewer values as being anomalous (in particular values which haven't been observed in successful sign-ins are less likely to be classed as anomalous).
Setting an Anomaly Threshold on Probabilities
When determining whether values are anomalous, we can't just set a static threshold on the estimated probabilities (i.e. if the likelihood of a value is less than , class it as an anomaly) - what constitues a good threshold will depend on the distribution of the observed values for that feature. For example, suppose we observe 20 different cities (derived from GeoIP data), and we 95% of successful sign-ins are from city 1, 4% are from city 2, and 1% are from cities 3 - 20. Then, we would probably want to class cities 2 - 20 as anomalous. Now suppose that we instead observed the following distribution: 24% of successful sign-ins are from city 1, 4$ of successful sign-ins are from each of cities 3-20. In this case, we would not want to class cities 2 - 20 a being anomalous even though they are below the same threshold as in the first scenario, since this would mean saying that 76% of all sign-ins had an anomalous sign-in location - this would make for a very noisy approach!
Instead, we set thresholds dynamically on a per-feature basis by using basic outlier detection - specifically we set a threshold on where this is more than standard deviations below the mean ( by default, but can be tuned). This is equivalent to standard-scaling the column of log-probabilities before using a static threshold. (This threshold can also be given an information-theoretic interpretation, since, for example, is just entropy.)
First we run the anomalous feature detection algorithm described above - this produces a dataframe in which the rows comprise a (failed) sign-in ID and boolean flags for each sign-in property denoting whether or not that property took an anomalous value.
Now we collapse our binary "IsAnom_*" columns into a single column of binary vectors representing which features are anomalous for each failed sign-in. (This restructuring of the data will be more convenient for later analysis.)
Cluster Failed Sign-Ins
The core hypothesis for this detection algorithm is that the distribution of anomalous features looks very different depending on how the sign-in was generated - in particular, sign-ins from a password spray campaign in which attackers use tooling to spoof multiple sign-in properties will have a distinctive "fingerprint" of features that are often anomalous together.
For example, suppose that all failed sign-in attempts come from three sources: legitimate user error, pssword spray campaign 1 and password spray campaign 2. For each of these classes, the probability of a given feature being anomalous may look like this:
| Source | etc. | |||||
|---|---|---|---|---|---|---|
| Legitimate user error | 0.02 | 0.1 | 0.01 | 0.2 | 0.25 | ... |
| PW Spray 1 | 0.7 | 0.95 | 0.6 | 0.9 | 0.85 | ... |
| PW Spray 2 | 0.1 | 0.8 | 0.1 | 0.6 | 0.8 | ... |
From the hypothetical probabilities in the table, we can see that, for each class of sign-ins, the set of features which are usually anomalous forms a fingerprint for the class.
Obviously, in practice, the sources of sign-ins are latent variables - i.e. they cannot be observed directly. Instead, we work backwords from our dataset of failed sign-ins and associated anomalous features to try to detect the latent classes and associated probabilities for each feature taking an anomalous value. From our hypothesis, we hope that, if a password spray campaign is present in our data, it will correspond to one of the detected clusters of failed sign-ins.
Mathematically, we do this by modelling our dataset as being generated from a Bernouli mixture model. We then perform variational Bayesian inference to try to detect the presence of latent classes.
In the next cell, We use the bayespy Python package to set up the Bernoulli mixture model and run variational Bayesian inference - see Bernoulli mixture model — BayesPy Documentation.
Notes:
Set the number of clusters to look for. The true number of groups is unknown to us, so we use an upper bound for the number of clusters we expect to be present (10 is a resonable number to start with) - the algorithm may assign 0 weight to some clusters if this is too large.
This step is not deterministic - rerunning may give slightly different clusterings! If this causes issues, we can simply re-run the variational Bayesian inference multiple times and select the model with the highest ELBO value.
Visualize Clusters
We use Hinton diagrams to visually represent the learned clusters. Areas of filled squares represent probabilities (and non-filled squared are used to show uncertainty).
The first digram shows the probabilities that a randomly selected failed sign-in will be assigned to that cluster by our model (the areas of the squares are proportional to the cluster assignment probabilities).
In the second diagram. columns represent clusters and rows represent features, so, for example, a large white square in the 2nd column, 4th row would indicate that, failed logins in cluster #2 are likely to have an unusual value for feature #4.
Prune Clusters
We first use our learned model to assign each failed sign-in to a cluster along with the associated probability of the sign-in belonging to the cluster.
Now we do some pruning of the learned clusters to remove those which are unlikely to represent the type of password spray activity we are looking for.
First we set a confidence threshold to prune failed logins included in each cluster (intra-cluster pruning). We then prune clusters by
Setting a minimum size for clusters of interest
Setting a minimum threshold on the number of features consistently taking anomalous values within a cluster
Note: The thresholds to use will depend very much on the data on which the algorithm is being run; start low, and increase the thresholds if results are too noisy.
We now have our candidate low and slow password spray campaigns! These campaigns/clusters will be further pruned when we use msticpy for specific analysis and TI enrichment of these campaigns.
Export Results to ADLS
At this point, we have all the data that we need fromt the big data analytics and ML steps using Spark, and can write the data back to the data lake before stopping the Spark session to minimize compute cost. THis will allow the data to be read into the AML notebook context where we further erich, analyze and visualize these outputs before creating writing back to Sentinel.
The following outputs will be persisted:
Full SigninLogs rows for candidate password spray sign-ins
Aggregated sign-in timestamps - these will be used for some timeseries vizualizations using
msticpyAggregated sign-in locations - these will be used for geo-plotting using
msticpyVarious "baseline" statistics - these will be used as part of reporting back to Sentinel
Sample of successful sign-ins - this will be used in MSTICPy vizualizations and as part of reporting back to Sentinel
Each of the above outputs will be saved as a single json file in ADLS.
Export Candidate Password Sprays
Export Aggregated Data/Statistics
Export Baseline Sample
Stop Spark Session
3. Analyze Clusters on AML Compute
Export results from ADLS to local filesystem
Load the Data from ADLS
In below sections, we will provide input details about ADLS account ad then use functions to connect , list contents and download results locally.
If you need help in locating input details, follow below steps
Go to the https://web.azuresynapse.net and sign in to your workspace.
In Synapse Studio, click Data, select the Linked tab, and select the container under Azure Data Lake Storage Gen2.
Navigate to folder from the container, right click and select Properies.
Copy ABFSS path , extact the details and map to the input fields
You can check View account access keys doc to find and retrieve your storage account keys for ADLS account.
Warning: If you are storing secrets such as storage account keys in the notebook you should
probably opt to store either into msticpyconfig file on the compute instance or use
Azure Key Vault to store the secrets.
Read more about using KeyVault
in the MSTICPY docs
Analyze Clusters Using MSTICPy
Having used using big data analytics and ML to slim reduce our SigninLogs data to a handful of candidate low and slow password spray clusters, we are now ready to investigate each of the generated clusters
The two broad questions to try to answer at this stage are:
Do the clusters represent likely (low and slow) password spray activity?
Do the clusters exhibit any distinctive properties that will aid with remediation and/or attribution? (E.g. Do the sign-ins all use an unusual user agent that could be blocked?)
This information can be added to incidents written back to Sentinel.
In the following section, we use MSTICPy's built-in security analytics tools to better understand each cluster. We only present a few general techniques here - your investigation may lead you down a different route.
Vizualize Clusters
The candidate low and slow password spray clusters have been generated based on the mix of features which are typically anomalous. We can plot charts for each candidate cluster showing, for each sign-in property/feature
The number of sign-ins where that property/feature is anomalous
The "variability" of that property/feature
Together, these two properties "fingerprint" each cluster and can give inform the direction further hunting. For example, suppose a cluster is characterised by its sign-ins having anomalous "ClientAppUsed" and "Location" peroperties, and suppose that the "variability" for yhese properties is low within the cluster. This indicates, that a relatively small number of anomalous client apps / sign-in locations are being used, which means that there is potential to write a rule-based detection on these static anomalous values.
Times Series Analysis
A good indicator of low and slow password spray-like activity is regular patterns in the times of the candidate sign-ins. Although threat actors add some random noise to the schedule on which password spray sign in attempts occur, when viewed as a whole, there is often still a distinctive uniformity to the time series of sign in attempts as attacker endeavour to avoid lock-out.
In order to test sign-ins in each of our candiadate clusters for "uniform spread" over time, we perform a Kolmogorov-Smirnov goodness-of-fit test against a uniform distribution. The output value will be between 0 and 1, with values closer to zero indicating that sign-in times are unlikely to generated from a uniform distribution.
Similarly, normal sign-in activity will exhibit distinctive day/week seasonality which we can check for in our candidate low and slow password spray clusters.
Timeseries Plots
The above statistics does not capture the many different patterns we might see in attacker behaviour (especially as attackers use increasingly sophisticated techniques to avoid detection). A time-plot vizualization can highlight patterns not captured by analytics.
Common things to look out for:
Sign-in attempts spread out fairly uniformly over time.
Lack of day/week/month seasonal patterns
Sign-ins on particularly unusual days (e.g. public holidays)
You can also modify the plot below to show just the sign-in day-of-week or hour-of-day.
Sign-In Location Analysis
We can use msticpy's visualisation libraries to plot locations on a map. This can be particularily useful when looking at the distribution of anomalous sign in attempts.
You may see the GeoLite driver downloading its database the first time you run this.
We use two plots to answer two questions in this section:
Are sign-in attempts generally from unusual locations as compared to the baseline successful sign-ins?
Can we learn anything more specific about where sign-ins for each clusters are coming from?
Threat Intelligence Enrichment
In this step, we can perform threatintel lookup using msticpy and open source TI providers such as IBM Xforce, VirusTotal, Greynoise etc. The below examples show TI lookups on single IP as well as a bulk lookup on all ips using IBM Xforce TI Provider.
You will need to register with IBM Xforce and enter API keys into mstipyconfig.yaml
- More details are shown in the A Tour of Cybersec notebook features notebook
- Threat Intel Lookups in MSTICPy
- To learn more about adding TI sources, see the TI Provider setup in the A Getting Started Guide For Microsoft Sentinel ML Notebooks notebook
Whois registration enrichment
In this step, we can perform whois lokup on all public destination ips and populate additional information such as ASN. You can use this output to further filter known ASNs from the results.
Other
There is a lot more data available in the SigninLogs table that we haven't looked at. Using the MSTICPy DataViewer control below, you can interactively inspect your raw data to see if anything stands out.
Every security investigation is different, and will depend heavily on your data and environment. There are many more tools (including those in MSTICPy) that you may wish to use to further your investigation. Take a look at our guided hunting blog post and the MSTICPy notebook examples.
4. Create Sentinel Incidents
To support security analysts to respond to these candidate password spray events, we create custom incidents in the Sentinel workspace.
MSTICPy has built-in support for reading from, and writing to, Microsoft Sentinel. Using the provided API, we first create a single incident to indicating potential low and slow password spray activity. We then add comments to the incident giving details of the each candidate campaign, including details of machines affected. This makes it easy for security analysts to make use of the outputs of this ML notebook to take further action as appropriate.
You may wish to modify the structure of the incidents written back to Sentinel based on your team's workflow.
Conclusion
Due to the nature of low and slow password sprays, we needed to start our hunting on very large datasets of historical sign in logs. The sheer scale of data made Spark a great tool to allow us to easily perform distributed data operations at scale. We then executed several analytical queries to surface series of failed sign in attempts with high IP volatility based on known patterns used by attackers. In order to analyze this data further, we use msticpy's data enrichment and visualization capabilities
Analysts can perform further investigation and can then create incidents in Microsoft Sentinel and track investigations in Sentinel. Details of possible next steps to take are in the accompanying Microsoft Tech Community blog post: Microsoft Sentinel Blog - Microsoft Tech Community. For more information on hunting and incident response playbooks for password sprays, please see Password spray investigation | Microsoft Docs.