Path: blob/master/scenario-notebooks/Export Historical Log Data.ipynb
3250 views
Historical Data Export
Notebook Version: 1.0
Python Version: Python 3.8
Required Packages: azure-monitor-query, azure-storage-file-datalake, azureml-synapse
Platforms Supported: Azure Machine Learning Notebooks connected to Azure Synapse Workspace
Data Source Required: Yes
Data Source: Azure Log Analytics
Spark Version: 3.1 or above
Description
Use this notebook to export of historical data in your Log Analytics workspace. (This notebook extends the continuous log export tool in Microsoft Sentinel (see docs and blog post), with historical logs exported using the same data format and partition scheme as the continuously exported logs.)
In this notebook, we demo the one-time export of historical log data from the SigninLogs table, but this can easily be modified to target any subset of log data. The notebook only needs to be run once, following which exported logs can be used in Sentinel notebooks for large scale data analytics and machine learning on log data (see the Sentinel template notebooks for more details). Alternatively, you may wish to use this tool for archiving older logs in more cost-effective archive-tier Azure storage.
This notebook also makes use of the Azure Synapse integration for Sentinel notebooks to enable partitioning and writing of data files at scale. To set up the Synapse integration for Sentinel noteboks, please follow the instructions here.
Steps
Fetch data from the Sentinel workspace using the
azure-monitor-queryPython package.The requested data may be the entirety of a given table, or based on a custom KQL query
Querying and fetching of data is automatically chunked and run asynchronously to avoid API throttling issues
Write data to Azure Data Lake Gen2
Use Apache Spark (via Azure Synapse) to repartition data to match the partition scheme created by the continuous log export tool
The continuous log export tool created a separate nested directory for each (year, month, day, hour, 5-minute interval) tuple in the format:
{base_path}/y=<year>/m=<month>/d=<day>/h=<hour>/m=<5-minute-interval>For a year's worth of historical log data, we may be writing over 100,000 separate data files, so we rely on Spark's multi-executor parallelism to do this efficiently
Use the
azure-storage-file-datalakePython package to do rename a few high-level directories to match the partition scheme used by the continuous log export toolClean up any data stored in intermediate locations during the data ETL process
Install Pre-Requisite Packages
0. Setup
Authenticate and connect to your Log Analytics workspace. The workspace ID is loaded from the config.json file which will have been created when you set up your first Sentinel notebook.
We now define functions that we will use to query and fetch the historical log data to export/archive. We use the azure-monitor-query Python package to invoke the Log Analytics REST API. Queries are chunked by time range to avoid throttling and truncation of returned data and the chunked API calls are executed asynchronously.
Edit the KQL query in the cell below as appropriate to specify which logs/columns to query; to specify all the available logs for a given table, just set QUERY = <name-of-table>.
Here, we define the functions that will be used to interact with the ADLS storage account(s) to which we want to export our historical log data. These use the azure-storage-file-datalake Python package which uses the ADLS Gen2 REST API under the hood.
Nothing to change here (just boilerplate) - you can just run this cell as-is!
1. Fetching Log Data
You may wish to estimate costs that will be incurred on your Azure storage account before beginning the data export process. The query in the cell below will estimate the size of the data to be exported along with the number of blobs that would be written if you choose to partition data into 5-minute intervals (as is done by the continuous log export). Use this in conjunction with the Azure storage pricing calculator to determine costs that will be incurred for your storage setup. (Full billing details for ADLS Gen2 can be found here.)
Use the query in the next cell to inform how much data to export.
This notebook fetches historical log data (via asynchronous, chunked queries to the log analytics REST API).
Variables to set:
end_time- The datetime up to which logs should be fetched. If you have already set up a continuous export pipeline, ensure that this is set to an earlier date than the start of the continuously exported data.days_back- How far back before theend_dateto query logsbatch_size_rowsORbatch_size_days- How many rows/fractional days of data to query in each API call
Log Analytics workspace and log queries in Azure Monitor are multitenancy services that include limits to protect and isolate customers and to maintain quality of service. When querying for a large amount of data, you should consider the following limits
| Limit (as of May 2022) | Remedy |
|---|---|
| Maximum of 500,000 rows returned per query | Set batch_size_rows to <490K (or reduce value of batch_size_days) |
| Maximum of ~100 MiB of data returned per query | - Reduce the set of columns being exported. This can have a significant performance impact! - If number of columns can't be reduced, reduce batch_size_rows/batch_size_days |
| Maximum of 200 requests per 30 seconds | - Increase batch_size_rows/batch_size_days- Reduce days_back- Try again! (Some 'Server Disconnected' errors are ephemeral and not due to the rate limit being hit) |
EXPERIMENTAL: Setting auto_batch = True will cause the query function to intelligently chunk queries based on rows returned and estimated data size, so as to avoid API limits.
Also consider that the fetched data will (initially) be held in memory on your Azure ML compute instance; depending on the volume of historical logs you wish to export, you may reach VM memory limits.
If this is a problem, consider exporting the data in chunks - e.g. instead of exporting 365 days of data at once export 100 days of data at a time by setting the values of end_time and days_back appropriately and re-running the notebook from this cell onwards for each separate chunk.
Alternatively, use a compute instance with more memory to run this notebook (such as the Azure E-Series VMs).
Note: This cell may take a while to run!
Before continuing, check the output from the cell above:
Ensure that the dataframe in the output of the cell above contains the expected data
Ensure that there are no query failure or data truncation error messages in the output of the cell above
2. Write Data to ADLS
In the cell below, specify the details of the storage account and file to which to write the log data dataframe. This will write a single JSON file containing the historical log data. This is only used a temporary staging location before the data is repartitioned across a large number of smaller files using Spark.
Note: Your Azure Synapse workspace will need to be able to access this ADLS location in the next step.
3. Repartition Data Using Spark
Now we use PySpark to repartition the exported log data by timestamp in the following format:
We also write to the same location used by the continuous log export: the base_path used by the continuous log export is:
Storing the exported logs data like this has two benefits:
This matches the partition scheme used by continuously exported logs; this means that means that continuously exported data and historical log data can be read from in a unified way by any notebooks or data pipelines that consume this data
Partitioning by timestamps can allow for efficient querying of file data - by encoding the timestamp values in file paths, we can minimise the number of required file reads when loading data from a specific time range in downstream tasks
Configure Azure ML and Azure Synapse Analytics
To run this Synapse PySpark code, the Synapse integration for AML notebooks needs to be set up: please refer to the template notebook, Configurate Azure ML and Azure Synapse Analytics, to configure environment.
The notebook will configure existing Azure synapse workspace to create and connect to Spark pool. You can then create linked service and connect AML workspace to Azure Synapse workspaces.
Note: Specify the input parameters in below step in order to connect AML workspace to synapse workspace using linked service.
Start Spark Session
Enter your Synapse Spark compute below. To find the Spark compute, please follow these steps:
On the AML Studio left menu, navigate to Linked Services
Click on the name of the Link Service you want to use
Select Spark pools tab
Get the name of the Spark pool you want to use
Repartition Data Using PySpark
Having started the Spark session, we can run PySpark code by starting a cell with the %%synapse line magic.
Doing this part of the data ETL process using Spark allows the partitioning and writing of data to be hugely parallelized - for a year's worth of log data, we may be creating over 100,000 data files (one per partition).
First we read in the historical log data from ADLS (where we exported the data to in step 2.) into the Spark context.
Fill in the details of the sotrage account, ADLS container and directory/file to which the historical logs were exported in step 2.
Next, we use the TimeGenerated column to create the year, month, day, etc. columns we will use as partition keys.
Here we repartition the data on the partition keys created above, then write out one file per partition to the same location as the continuous log export. This will mean that continuously exported data and historical log data can be read from in a unified way and can also increase read performance for future use of the exported data.
Fill in the details of the storage account, ADLS container to which to write the repartitioned data (usually you will want this to be the same container to which the continuous export tool is writing logs). Also filling in the Sentinel workspace subscription, resource group and workspace name ensures that logs are written to the same path as continuously exported logs.
Stop Spark Session
4. Rename Some Directories
There is one remaining difference between the partition scheme used by the continuous export and the historical data we have exported - namely, the "month" column needs to be renamed to "m". We can make this change by using the ADLS Python SDK.
That is, we need to change the directory structure from {base_path}/y=<year>/month=<month>/d=<day>/h=<hour>/m=<5-minute-interval> to {base_path}/y=<year>/m=<month>/d=<day>/h=<hour>/m=<5-minute-interval>
Note: This type of change can be done efficiently due to the nature of the ADLS gen2 hierarchical filesystem (see details). Doing a similar task on a standard blob storage container would be much slower since the pseudo-filesystem means that each of the blobs within the high-level 'directory' would need to be individually renamed under the hood.
5. Cleanup
Our exported historical log data is now ready. We can now remove any data from the intermediate staging location.
Note: You may want to navigate to your Azure storage account and check that the data has been written correctly before deleting any data.