Path: blob/master/tutorials-and-examples/feature-tutorials/AnomalousSequence.ipynb
3253 views
Table of Contents
msticpy - anomalous_sequence subpackage
Various types of security logs can be broken up into sessions/sequences where each session can be thought of as an ordered sequence of events. It can be useful to model these sessions in order to understand what the usual activity is like so that we can highlight anomalous sequences of events.
A new subpackage called anomalous_sequence has been released to msticpy recently. This library allows the user to sessionize, model and visualize their data via a high level interface.
This notebook demonstrates the sessionizing, modelling and visualisation on some Office Exchange Admin logs from one of our demo tenants. However there is a section at the end which demonstrates how some other log types can be sessionized as well.
What is a Session?
In this context, a session is an ordered sequence of events/commands. The anomalous_sequence subpackage can handle 3 different formats for each of the sessions:
sequence of just events/commands. e.g. ["Set-User", "Set-Mailbox"]
sequence of events/commands with accompanying parameters. [Cmd(name="Set-User", params={"Identity', "Force"}), Cmd(name="Set-Mailbox", params={"Identity", "AuditEnabled"})]
sequence of events/commands with accompanying parameters and their corresponding values. [Cmd(name="Set-User", params={"Identity": "blahblah", "Force": 'true'}), Cmd(name="Set-Mailbox", params={"Identity": "blahblah", "AuditEnabled": "false"})]
The Cmd datatype can be accessed from msticpy.analysis.anomalous_sequence.utils.data_structures
If you are only interested in modelling the commands (without the accompanying parameters), then you could skip the next three cells and go straight to the sessionizing.
The reason for this is because each session is allowed to be either a list of strings, or a list of the Cmd datatype. The "Operation" column is a string already.
However, if you are interested in including the parameters (and possibly the values), then you need the next two cells.
We need to define a custom cleaning function which will combine the "Operation" and "Parameters" columns and convert them into one of the allowed types. This cleaning function is specific to the format of the exchange demo data which we have read in. Therefore, you may need to tweak it before you can use it on other data sets.
Use the sessionize_data function
We will do this for the first session type (with just commands).
But because we created columns for all three session types, you can set the "event_col" parameter in the "sessionize_data" function below to any of the following:
Operation
cmd_param
cmd_param_val
Here are some details about the arguments for the sessionize_data function:
Model the sessions
We will give a brief description of how the modelling works under the hood for each of the three session types.
Commands only
We treat the sessions as an ordered sequence of commands.
We apply the Markov assumption where we assume each command depends only on the command immediately before it.
This means the likelihood of each session can be computed by multiplying a sequence of transition probabilities together.
We use a sliding window (e.g. of length 3) throughout each session and then use the likelihood of the rarest window as the score for the session.
Commands with Parameters
All of the above ("commands only" case) except for one difference.
This time, we include the parameters in the modelling.
We make the assumption that the presence of each parameter is independent conditional on the command.
We therefore model the presence of the parameters as independent Bernoulli random variables (conditional on the command)
So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilities (for the parameters).
A subtlety to note, is that we take the geometric mean of the product of parameter probabilities. This is so we don't penalise commands which happen to have more parameters set than on average.
We use the same sliding window approach used with the "commands only" case.
Commands with Parameters and their Values
All of the above ("commands with parameters" case) except for one difference.
This time, we include the values in the modelling.
Some rough heuristics are used to determine which parameters have values which are categorical (e.g. "true" and "false" or "high", "medium" and "low") vs values which are arbitrary strings (such as email addresses). There is the option to override the "modellable_params" directly in the Model class.
We also make the assumption that the values depend only on the parameters and not on the command.
So to compute the likelihood of a session, each transition probability (of the commands) will be accompanied by a product of probabilities (for the parameters and categorical values).
We use the same sliding window approach used with the "commands only" case.
Important note:
If you set the window length to be k, then only sessions which have at least k-1 commands will have a valid (not np.nan) score. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring.
There are 3 high level functions available in this library
score_sessions
visualize_scored_sessions
score_and_visualize_sessions
We will first demonstrate the high level function for modelling the sessions.
We will do this for the "Commands Only" session type.
But depending on which column you chose as the event_col in the sessionize_data function, you could set the "session_column" parameter in the "score_sessions" function below to any of the following:
Operation_list
cmd_param_list
cmd_param_val_list
Here are some details about the arguments for the score_sessions function:
Now we demonstrate the visualization component of the library
We do this using the "visualise_scored_sessions" function. This function returns an interactive timeline plot which allows you to zoom into different sections etc.
The time of the session will be on the x-axis.
The computed likelihood metric will be on the y-axis.
lower likelihoods correspond to rarer sessions.
Important note:
During the scoring/modelling stage, if you set the window length to be k, then only sessions which have at least k-1 commands will appear in the interactive timeline plot. This is because sessions with fewer than k-1 commands will have a score of np.nan. The reason for the -1 is because we append an end token to each session by default, so a session of length k-1 gets treated as length k during the scoring.
Here are some details about the arguments for the visualise_scored_sessions function:
Now we demonstrate how you can score and visualise your sessions in one go.
We will do this for the "Commands only" session type.
But depending on which column you chose as the event_col in the sessionize_data function, you could set the "session_column" parameter in the "score_and_visualise_sessions" function below to any of the following:
Operation_list
cmd_param_list
cmd_param_val_list
Here are some details about the arguments for the score_and_visualise_sessions function:
Advanced Users: Access the Model Class Directly
Users who would like to configure arguments related to whether start and end tokens are used or whether the geometric mean is computed, can access the Model class directly.
There is also the option to specify the modellable_params argument if you do not wish for rough heuristics to be used to determine which parameters take categorical values and are hence suitable for modelling. If you wish to experiment with modelling the values of all the parameters (categorical + arbitrary strings), then you can use this argument to do so.
Here are some details about the methods available for the Model class:
Sessionize Some Other Types of Logs using KQL
The aim of this section is to provide some starter guidance on how one might start to sessionize + model some other types of logs.
In order to do the sessionizing using KQL, we use the row_window_session function.
Important note: Throughout this section, the decisions made about which columns should be interpreted as commands/events and parameters are entirely subjective and alternative approaches may also be valid.
Sessionize Office Activity Logs
The cell below contains a kusto query which queries the OfficeActivity table in Log Analytics. In this example, we wish for the sessions to be on a per UserId - ClientIP basis. In addition, we require that each session be no longer than 20 minutes in total, with each command no more than 2 minutes apart from each other. (These requirements can be adjusted for different data-sets/use-cases etc).
Here are some high level steps to the query:
Add a time filter which goes back far enough so you have enough data to train the model.
Filter to the desired type of logs.
Exclude some known automated users (optional)
Sort the rows by UserId, ClientIp, TimeGenerated in ascending order
Use the native KQL function row_window_session to create an additional "begin" column to aid creating the sessions
Summarize the commands (and optionally parameters) by UserId, ClientIp, begin
Optionally exclude sessions which have only 1 command
Note that in KQL, comments are made using //
Convert Exchange Sessions to Correct Format for the Model
Recall the allowed session types here
So let's see what needs to be done to the exchange_df
The "cmds" column is already in a suitable format of type (1). This is because it is a list of strings.
If we wish to also include the parameters (and optionally the corresponding values) to the model, then we need to transform the "params" column slightly
Now we will model and visualise these sessions in one go.
We do this using the score_and_visualise_sessions function.
Since we created columns for all 3 session types, the session_column argument can be set to any of the following:
session
param_session
param_value_session
Sessionize AWS Cloud Trail Logs
The cell below contains a kusto query which queries the AWSCloudTrail table in Log Analytics. In this example, we wish for the sessions to be on a per UserId - ClientIP - UserAgent - role basis. In addition, we require that each session be no longer than 20 minutes in total, with each command no more than 2 minutes apart from each other. (These requirements can be adjusted for different data-sets/use-cases etc).
Note we choose a much shorter time_back in this KQL query. This is just because the AWS Cloud Trail logs have a lot more data when compared with the exchange admin logs for this demo tenant. We therefore choose a shorter time back purely to prevent this demo notebook from slowing down.
Convert AWS sessions to the correct format for the model
Recall the allowed session types here
So let's see what needs to be done to the aws_df
The "cmds" column is already in a suitable format of type (1). This is because it is a list of strings. If we wish to also include the parameters (and optionally the corresponding values) to the model, then we need to transform the "params" column slightly
Now we will model and visualise these sessions in one go.
We do this using the score_and_visualise_sessions function.
As before, since we created columns for all 3 session types, the session_column argument can be set to any of the following:
session
param_session
param_value_session
Sessionize VM Process Logs
The cell below contains a kusto query which queries the VMProcess table in Log Analytics. In this example, we wish for the sessions to be on a per UserId - Computer basis. In addition, we require that each session be no longer than 20 minutes in total, with each command no more than 2 minutes apart from each other. (These requirements can be adjusted for different data-sets/use-cases etc).
Note that in the examples for Office Activity and AWS Cloud Trail logs, it was fairly clear cut from the data what we could use as parameters for each of the events/commands. However, for the VM Process Logs, it is less clear.
Some possible approaches:
The command line entries are provided. So a possible approach could be to parse the command line logs into the commands used and their accompanying parameters.
The executable name could be used as the event/command
a) The services associated with the executable could be used as the parameters
b) Or we could use a combination of some other columns as the parameters
In this example, we apply approach (2b). In particular, we use "ExecutableName" as the event/command, and the following columns as parameters: "DisplayName", "ProductName", "Group", "ProductVersion", "ExecutablePath".
Important note:
Some modelling assumptions are made in the anomalous_sequence subpackage of msticpy.
In particular, when we model the third session type (command + params + values), we make the assumption that the values depend only on the parameter and not on the command.
This means if we were to treat the parameters as a dictionary for example:
Cmd(name="miiserver", params={"ProductVersion": "123542", "ExecutablePath": "a/path"})
Then the value "123542" will be conditioned only on param "ProductVersion" and value "a/path" will be conditioned only on param "ExecutablePath". But since ProductVersion, and ExecutablePath parameters will be present for all the events, this is not useful. We want the values to be conditioned on the executable.
Therefore, for this approach, we will use the second session type (command + params). For example:
Cmd(name="miiserver", params={"123542", "a/path"})
Now, the presence of "123542" and "a/path" will be modelled independently conditional on the executable "miiserver"
(note, this modification is still not perfect, since "123542" and "a/path" will each be modelled as Bernoulli instead of categorical. But this approach should hopefully still be affective at downscoring the likelihood of the rarer param settings conditional on the executable.)
Convert VM Process sessions to the correct format for the model
Recall the allowed session types here
So let's see what needs to be done to the vm_df
The "executables" column is already in a suitable format of type (1). This is because it is a list of strings. If we wish to also include the parameters to the model, then we need to transform the "params" column slightly.
Now we will model and visualise these sessions in one go.
We do this using the score_and_visualise_sessions function.
As before, since we created columns for 2 of the 3 session types, the session_column argument can be set to any of the following:
session
param_session