Path: blob/master/Guided Hunting - Anomalous Office365 Exchange Sessions.ipynb
3249 views
Sessionize, Model and Visualise Office Exchange Data
Notebook Version: 1.0
Python Version: Python 3.6 (including Python 3.6 - AzureML)
Required Packages: msticpy, pandas, kqlmagic
Data Sources Required:
Log Analytics - OfficeActivity
Configuration Required:
This Notebook presumes you have your Microsoft Sentinel Workspace settings configured in a config file. If you do not have this in place, please read the docs and use this notebook to test.
Description:
Various types of security logs can be broken up into sessions/sequences where each session can be thought of as an ordered sequence of events. It can be useful to model these sessions in order to understand what the usual activity is like so that we can highlight anomalous sequences of events.
In this hunting notebook, we treat the Office Exchange PowerShell cmdlets ("Set-Mailbox", "Set-MailboxFolderPermission" etc) as "events" and then group the events into "sessions" on a per-user basis. We demonstrate the sessionizing, modelling and visualisation on the Office Exchange Admin logs, however the methods used in this notebook can be applied to other log types as well.
A new subpackage called anomalous_sequence has been released to msticpy recently. This library allows the user to sessionize, model and visualize their data via some high level functions. For more details on how to use this subpackage, please read the docs and/or refer to this more documentation heavy notebook. The documentation for this subpackage also includes some suggested guidance on how this library can be applied to some other log types.
High level sections of the notebook:
Sessionize your Office Exchange logs data using built-in KQL operators
Use the anomalous_sequence subpackage of msticpy to model the sessions
Use the anomalous_sequence subpackage of msticpy to visualize the scored sessions
Table of Contents
Notebook initialization
The next cell:
Checks for the correct Python version
Checks versions and optionally installs required packages
Imports the required packages into the notebook
Sets a number of configuration options
This should complete without errors. If you encounter errors or warnings, please look at the following two notebooks:
Create Sessions from your Office Exchange logs
What is a Session?
In this context, a session is an ordered sequence of events/commands. The anomalous_sequence subpackage can handle 3 different formats for each of the sessions:
sequence of just events/commands. ["Set-User", "Set-Mailbox"]
sequence of events/commands with accompanying parameters. [Cmd(name="Set-User", params={"Identity', "Force"}), Cmd(name="Set-Mailbox", params={"Identity", "AuditEnabled"})]
sequence of events/commands with accompanying parameters and their corresponding values. [Cmd(name="Set-User", params={"Identity": "blahblah", "Force": 'true'}), Cmd(name="Set-Mailbox", params={"Identity": "blahblah", "AuditEnabled": "false"})]
The Cmd datatype can be accessed from msticpy.analysis.anomalous_sequence.utils.data_structures
How will we sessionize the data?
We discuss two possible approaches:
Use the sessionize module from msticpy's anomalous_subsequence subpackage
Sessionize directly inside your KQL query to retrieve data from Log Analytics
In this notebook, we use the second approach (KQL) to sessionize the Office Exchange logs. In order to do the sessionizing using KQL, we make use of the row_window_session function.
However, if you are interested in using msticpy's sessionizing capabilities, then please read the docs and/or refer to this more documentation heavy notebook.
Use Kusto to Sessionize your Logs Data
The cell below contains a kusto query which queries the OfficeActivity table. In this example, we wish for the sessions to be on a per UserId - ClientIP basis. In addition, we require that each session be no longer than 20 minutes in total, with each command no more than 2 minutes apart from each other. (These requirements are somewhat arbitrary and can be adjusted for different data-sets/use-cases etc).
Here are some high level steps to the query:
Add a time filter which goes back far enough so you have enough data to train the model.
Filter to the desired type of logs.
Exclude some known automated users (optional)
Sort the rows by UserId, ClientIp, TimeGenerated in ascending order
Use the native KQL function row_window_session to create an additonal "begin" column to aid creating the sessions
Summarize the commands (and optionally parameters) by UserId, ClientIp, begin
Optionally exclude sessions which have only 1 command
Note that in KQL, comments are made using //
Convert Sessions to Correct Format for the Model
Recall the allowed session types here
So let's see what needs to be done to the sessions_df.
The "cmds" column is already in a suitable format of type (1). This is because it is a list of strings. However, if you are interested in including the parameters (and possibly the values) in the modelling stage, then we need to make use of the Cmd datatype.
In particular, we need to define a custom cleaning function which will transform the "params" column slightly to become a list of the Cmd datatype. This cleaning function is specific to the format of the exchange demo data. Therefore, you may need to tweak it slightly before you can use it on other data sets.
Model the sessions
We will give a brief description of how the modelling works under the hood for each of the three session types.
Commands only
We treat the sessions as an ordered sequence of commands.
We apply the Markov Assumption where we assume each command depends only on the command immediately before it.
This means the likelihood of each session can be computed by multiplying a sequence of transition probabilities together.
We use a sliding window (e.g. of length 3) throughout each session and then use the likelihood of the rarest window as the score for the session.
Commands with Parameters
All of the above ("commands only" case) except for one difference.
This time, we include the parameters in the modelling.
We make the assumption that the presence of each parameter is independent conditional on the command.
We therefore model the presence of the parameters as independent Bernoulli random variables (conditional on the command)
So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilties (for the parameters).
A subtlety to note, is that we take the geometric mean of the product of parameter probabilities. This is so we don't penalise commands which happen to have more parameters set than on average.
We use the same sliding window approach used with the "commands only" case.
Commands with Parameters and their Values
All of the above ("commands with parameters" case) except for one difference.
This time, we include the values in the modelling.
Some rough heuristics are used to determine which parameters have values which are categorical (e.g. "true" and "false" or "high", "medium" and "low") vs values which are arbitrary strings (such as email addresses). There is the option to override the "modellable_params" directly in the Model class.
So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilties (for the parameters and categorical values).
We use the same sliding window approach used with the "commands only" case.
Important note:
If you set the window length to be k, then only sessions which have at least k-1 commands will have a valid (not np.nan) score. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring.
There are 3 high level functions available in this library
score_sessions
visualize_scored_sessions
score_and_visualize_sessions
We will demonstrate the usage of the first two functions, but the "score_and_visualize_sessions" function can be used in a similar way.
If you want to see more detail about any of the arguments to the functions, you can simply run: help(name_of_function)
We will first demonstrate the high level function for modelling the sessions.
We will do this for the "Commands with Parameters and their Values" session type.
But because we created columns for all three session types, you can set the "session_column" parameter in the "score_sessions" function below to any of the following:
session
param_session
param_value_session
Now we demonstrate the visualization component of the library
We do this using the "visualise_scored_sessions" function. This function returns an interactive timeline plot which allows you to zoom into different sections etc.
The time of the session will be on the x-axis.
The computed likelihood metric will be on the y-axis.
lower likelihoods correspond to rarer sessions.
Important note:
During the scoring/modelling stage, if you set the window length to be k, then only sessions which have at least k-1 commands will appear in the interactive timeline plot. This is because sessions with fewer than k-1 commands will have a score of np.nan. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring.