GitHub Repository: Azure/Azure-Sentinel-Notebooks
Path: blob/master/tutorials-and-examples/feature-tutorials/Data_Queries.ipynb
³²⁵³ views

Kernel: Python 3

Title: msticpy - Data

Description:

This package provides functions to allow for the defining of data sources, connectors to them, and queries for them as well as the ability to call these elements to return query result from the defined data sources. The package currently support connections to Log Analytics/Microsoft Sentinel/Azure Security Center, and the Microsoft Security Graph.

The first step in using this package is to install the msticpy package.

In [4]:

%pip install --upgrade msticpy[azsentinel]

Out[4]:

Collecting git+https://github.com/microsoft/msticpy
Building wheels for collected packages: msticpy
  Building wheel for msticpy (setup.py): started
  Building wheel for msticpy (setup.py): finished with status 'done'
Successfully built msticpy
Installing collected packages: msticpy
Successfully installed msticpy-0.2.1

In [1]:

#Check we are running Python 3.6
import sys
MIN_REQ_PYTHON = (3,6)
if sys.version_info < MIN_REQ_PYTHON:
    print('Check the Kernel->Change Kernel menu and ensure that Python 3.6')
    print('or later is selected as the active kernel.')
    sys.exit("Python %s.%s or later is required.\n" % MIN_REQ_PYTHON)

#imports
import yaml
import msticpy.nbtools as nbtools

#data library imports
from msticpy.data.data_providers import QueryProvider
import msticpy.data.data_query_reader as QueryReader
from msticpy.data.param_extractor import extract_query_params
import msticpy.nbtools as mas

print('Imports Complete')

Out[1]:

Imports Complete

Instantiating a Query Provider

In order to connect to and query a data source we need to define what sort of Data Environment we want to connect to and query (in this Notebook we will use Log Analytics as an example). To view the options available you can call QueryProvider.list_data_environments() which will return a list of all the available options.

After selecting a Data Environment we can initialize our Query Provider by calling QueryProvider(DATA_ENVIRONMENT). This will load the relavent driver for connecting to the data environment we have selected as well as provisioning a query store for us and adding queries from our default query directory.

There are two other optional parameters we can pass when initializing our Query Providers to further customize it:

We can also chose to initialize our Query Provider with a driver other than the defualt one with QueryProvider(data_environment=DATA_ENVIRONMENT, driver=QUERY_DRIVER)
We can choose to import queries from a custom query directory (see - Creating a new set of queries for more details) with QueryProvider(data_environment=DATA_ENVIRONMENT, driver=QUERY_DRIVER, query_path=QUERY_DIRECTORY_PATH).

For now we will simply create a Query Provider with default values.

Query provider interface to queries.

    Parameters
    ----------
    data_environment : Union[str, DataEnvironment]
        Name or Enum of environment for the QueryProvider
    driver : DriverBase, optional
        Override the built-in driver (query execution class)
        and use your own driver (must inherit from
        `DriverBase`)

In [2]:

data_environments = QueryProvider.list_data_environments()
print(data_environments)
qry_prov = QueryProvider(data_environment='LogAnalytics')

Out[2]:

['LogAnalytics', 'Kusto', 'AzureSecurityCenter', 'SecurityGraph']
Please wait. Loading Kqlmagic extension...

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Connecting to a Data Environment

Once we have instantiated the query provider and loaded the relevent driver we can connect to the Data Environment. This is done by calling the connect() function of the Query Provider we just initialized and passing it a connection string to use.

For Log Analytics/Microsoft Sentinel the connection string is in the format of loganalytics://code().tenant("TENANT_ID").workspace("WORKSPACE_ID"). Other Data Environments will have different connection string formats.

connect(self, connection_str: str, **kwargs):
   
    Connect to data source.

    Parameters
    ----------
    connection_string : str
        Connection string for the data source

In [3]:

ws_id = input('Workspace ID')
ten_id = input('Tenant ID')
la_connection_string = f'loganalytics://code().tenant("{ten_id}").workspace("{ws_id}")'
qry_prov.connect(connection_str=f'{la_connection_string}')

Out[3]:

{"name":"stdin","output_type":"stream","text":"Workspace ID xxxxxxxxxxxxxxxxxxxxxxxxxxx\nTenant ID xxxxxxxxxxxxxxxxxxxxxxxxxxx\n"}

<IPython.core.display.Javascript object>

Reviewing available queries

Upon connecting to the relevant Data Environment we need to look at what query options we have available to us. In order to do this we can call QUERY_PROVIDER.list_queries(). This will return a generator with the names of all the queries in our store.

The results returned show the data family the query belongs to and the name of the specific query.

list_queries(self):
    
    Return list of family.query in the store.

    Returns
    -------
    Iterable[str]
        List of queries

In [3]:

queries = qry_prov.list_queries()
for query in queries:
    print(query)

Out[3]:

LinuxSyslog.all_syslog
LinuxSyslog.cron_activity
LinuxSyslog.squid_activity
LinuxSyslog.sudo_activity
LinuxSyslog.user_group_activity
LinuxSyslog.user_logon
SecurityAlert.get_alert
SecurityAlert.list_alerts
SecurityAlert.list_alerts_counts
SecurityAlert.list_alerts_for_ip
SecurityAlert.list_related_alerts
WindowsSecurity.get_host_logon
WindowsSecurity.get_parent_process
WindowsSecurity.get_process_tree
WindowsSecurity.list_host_logon_failures
WindowsSecurity.list_host_logons
WindowsSecurity.list_host_processes
WindowsSecurity.list_hosts_matching_commandline
WindowsSecurity.list_matching_processes
WindowsSecurity.list_processes_in_session

To get further details on a specific query call QUERY_PROVIDER.DATA_FAMILY.QUERY_NAME('?') or QUERY_PROVIDER.DATA_FAMILY.QUERY_NAME('help')

This will display:

Query Name
What Data Environment it is designed for
Short description of what the query does
What parameter the query can be passed
The raw query that will be run

In [5]:

qry_prov.SecurityAlert.list_alerts('?')

Out[5]:

Query:  list_alerts
Data source:  LogAnalytics
Retrieves list of alerts

Parameters
----------
add_query_items: str (optional)
    Additional query clauses
end: datetime
    Query end time
path_separator: str (optional)
    Path separator
    (default value is: \\)
query_project: str (optional)
    Column project statement
    (default value is:  | project-rename StartTimeUtc = StartTime, EndTim...)
start: datetime
    Query start time
subscription_filter: str (optional)
    Optional subscription/tenant filter expression
    (default value is: true)
table: str (optional)
    Table name
    (default value is: SecurityAlert)
Query:
 {table} {query_project} | where {subscription_filter} | where TimeGenerated >= datetime({start}) | where TimeGenerated <= datetime({end}) | extend extendedProps = parse_json(ExtendedProperties) | extend CompromisedEntity = tostring(extendedProps["Compromised Host"]) | project-away extendedProps {add_query_items}

Running an pre-defined query

To run a query from our query store we again call QUERY_PROVIDER.DATA_FAMILY.QUERY_NAME(**Kwargs) but this time we simply pass required parameters for that query as key word arguments.

This will return a Pandas DataFrame of the results with the columns determined by the query parameters. Should the query fail for some reason an exception will be raised.

In [11]:

alerts = qry_prov.SecurityAlert.list_alerts(start='2019-07-21 23:43:18.274492', end='2019-07-27 23:43:18.274492')
alerts.head()

Out[11]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

It is also possible to pass queries objects as arguments before defining keywork arguments. For example if I wanted to define query times as an object rather than defining a start and end via keywork arguments I could simply pass a querytimes object to the pre-defined query.

In [8]:

query_times = mas.nbwidgets.QueryTime(units='day',  
                            max_before=40, max_after=1, before=5)
query_times.display()

Out[8]:

HTML(value='<h4>Set query time boundaries</h4>')

HBox(children=(DatePicker(value=datetime.date(2019, 7, 26), description='Origin Date'), Text(value='23:43:18.2…

VBox(children=(IntRangeSlider(value=(-5, 1), description='Time Range (day):', layout=Layout(width='80%'), max=…

In [10]:

qry_prov.SecurityAlert.list_alerts(query_times)

Out[10]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Running an ad-hoc query

It is also possible to run ad-hoc queries via a similar method. Rather than calling a named query from the Query Provider query store, we can pass a query directly to our Query Provider with QUERY_PROVIDER.exec_query(query=QUERY_STRING). This will execute the query string passed in the parameters with the driver contained in the Query Provider and return data in a Pandas DataFrame. As with predefined queries an exception will be raised should the query fail to execute.

query(self, query: str) -> Union[pd.DataFrame, Any]:
    Execute query string and return DataFrame of results.

    Parameters
    ----------
    query : str
        The kql query to execute

    Returns
    -------
    Union[pd.DataFrame, results.ResultSet]
        A DataFrame (if successful) or
        Kql ResultSet if an error.

In [12]:

test_query = '''
    SecurityAlert
    | take 5
    '''

query_test = qry_prov.exec_query(query=test_query)
query_test.head()

Out[12]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Creating a new set of queries

msticpy provides a number of pre-defined queries to call with using the data package. You can also add in additional queries to be imported and used by your Query Provider, these are defined in YAML format files and examples of these files can be found at the msticpy GitHub site https://github.com/microsoft/msticpy/tree/master/msticpy/data/queries.

The required structure of these query definition files is as follows:

metadata
- version: The version number of the definition file
- description: A description of the purpose of this collection of query definitions
- data_environments[]: A list of the Data Environments that the defined queries can be run against (1 or more)
- data_families[]: A list of Data Families the defined queries related to, these families are defined as part of msticpy.data.query_defns
- tags[]: A list of tags to help manage definition files
defaults: A set of defaults that apply to all queries in the file
- metadata: Metadata regarding a query
  - data_source: The data source to be used for the query
- parameters: Parameters to be passed to the query
  - name: The parameter name
  - description: A description of what the parameter is
  - type: The data type of the parameter
  - default: The default value for that parameter
sources: a set of queries
- name: The name of the query -description: A description of the query's function -metadata: Any metadata associated with the query -args: The arguments of the query -query: The query to be executed -uri: A URI associated with the query -parameters: Any parameters required by the query not covered by defaults - name: The parameter name - description: A description of what the parameter is - type: The data type of the parameter - default: The default value for that parameter

There are also a number of tools within the package to assist in validating new query definition files once created.

data_query_reader.find_yaml_files
    
    Return iterable of yaml files found in `source_path`.

    Parameters
    ----------
    source_path : str
        The source path to search in.
    recursive : bool, optional
        Whether to recurse through subfolders.
        By default False

    Returns
    -------
    Iterable[str]
        File paths of yaml files found.
        
 data_query_reader.validate_query_defs
     
     Validate content of query definition.

    Parameters
    ----------
    query_def_dict : dict
        Dictionary of query definition yaml file contents.

    Returns
    -------
    bool
        True if validation succeeds.

    Raises
    ------
    ValueError
        The validation failure reason is returned in the
        exception message (arg[0])

validate_query_defs() does not perform comprehensive checks on the file but does check key elements required in the file are present.

In [13]:

for file in QueryReader.find_yaml_files(source_path="C:\\queries"):
    with open(file) as f_handle:
        yaml_file = yaml.safe_load(f_handle)
        if QueryReader.validate_query_defs(query_def_dict = yaml_file) == True:
            print(f' {file} is a valid query definition')
        else:
            print(f'There is an error with {file}')

Out[13]:

 C:\queries\example.yaml is a valid query definition

Adding a new set of queries and running them

Once you are happy with a query definition file then you import it with QUERY_PROVIDER.import_query_file(query_file=PATH_TO_QUERY_FILE) This will load the query file into the Query Provider's Query Store from where it can be called.

In [16]:

qry_prov.import_query_file(query_file='C:\queries\example.yaml')

Once imported the queries in the files appear in the Query Provider's Query Store alongside the others and can be called in the same manner as pre-defined queries.

If you have created a large number of query definition files and you want to have the automatically imported into a Query Provider's query store at initialization you can specify a directory containing these queries in the msticpyconfig.yaml file under QueryDefinitions: Custom:

For example if I have a folder at C:\queries I will set the config file to:

QueryDefinitions: Default: "queries" Custom: - "C:\queries" - "C:\queries2

Having the Custom field populated will mean the Query Provider will automatically enumerate all the YAML files in the directory provided and automatically import he relevant queries into the query store at initialization alongside the default queries. Custom queries with the same name as default queries will overwrite default queries.

In [18]:

queries = qry_prov.list_queries()
for query in queries:
    print(query)

Out[18]:

LinuxSyslog.all_syslog
LinuxSyslog.cron_activity
LinuxSyslog.squid_activity
LinuxSyslog.sudo_activity
LinuxSyslog.syslog_example
LinuxSyslog.user_group_activity
LinuxSyslog.user_logon
SecurityAlert.get_alert
SecurityAlert.list_alerts
SecurityAlert.list_alerts_counts
SecurityAlert.list_alerts_for_ip
SecurityAlert.list_related_alerts
WindowsSecurity.get_host_logon
WindowsSecurity.get_parent_process
WindowsSecurity.get_process_tree
WindowsSecurity.list_host_logon_failures
WindowsSecurity.list_host_logons
WindowsSecurity.list_host_processes
WindowsSecurity.list_hosts_matching_commandline
WindowsSecurity.list_matching_processes
WindowsSecurity.list_processes_in_session

In [19]:

qry_prov.LinuxSyslog.syslog_example('?')

Out[19]:

Query:  syslog_example
Data source:  LogAnalytics
Example query

Parameters
----------
add_query_items: str (optional)
    Additional query clauses
end: datetime
    Query end time
host_name: str
    Hostname to query for
query_project: str (optional)
    Column project statement
    (default value is:  | project TenantId, Computer, Facility, TimeGener...)
start: datetime
    Query start time
subscription_filter: str (optional)
    Optional subscription/tenant filter expression
    (default value is: true)
table: str (optional)
    Table name
    (default value is: Syslog)
Query:
 {table} | where {subscription_filter} | where TimeGenerated >= datetime({start}) | where TimeGenerated <= datetime({end}) | where Computer == "{host_name}" | take 5

In [23]:

qry_prov.LinuxSyslog.syslog_example(start='2019-07-21 23:43:18.274492', end='2019-07-27 23:43:18.274492', host_name='UbuntuDevEnv')

Out[23]:

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

If you are having difficulties with a defined query and it is not producing the expected results it can be useful to see the raw query exactly as it is passed to the Data Environment. If you call a query with 'print' and the parameters required by that query it will construct and print out the query string to be run.

In [25]:

qry_prov.LinuxSyslog.syslog_example('print', start='2019-07-21 23:43:18.274492', end='2019-07-27 23:43:18.274492', host_name='UbuntuDevEnv')

Out[25]:

' Syslog | where true | where TimeGenerated >= datetime(2019-07-21 23:43:18.274492) | where TimeGenerated <= datetime(2019-07-27 23:43:18.274492) | where Computer == "UbuntuDevEnv" | take 5'

Title: msticpy - Data

Description:

Table of Contents