Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Azure
GitHub Repository: Azure/Azure-Sentinel-Notebooks
Path: blob/master/tutorials-and-examples/feature-tutorials/IoCExtract.ipynb
3253 views
Kernel: Python [conda env:condadev] *

Title: msticpy - IoC Extraction

Description:

This class allows you to extract IoC patterns from a string or a DataFrame. Several patterns are built in to the class and you can override these or supply new ones.

You must have msticpy installed to run this notebook:

%pip install --upgrade msticpy
# Imports import sys MIN_REQ_PYTHON = (3,6) if sys.version_info < MIN_REQ_PYTHON: print('Check the Kernel->Change Kernel menu and ensure that Python 3.6') print('or later is selected as the active kernel.') sys.exit("Python %s.%s or later is required.\n" % MIN_REQ_PYTHON) from IPython import get_ipython from IPython.display import display, HTML import matplotlib.pyplot as plt import pandas as pd pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 50) pd.set_option('display.max_colwidth', 100)
# Load test data process_tree = pd.read_csv('data/process_tree.csv') process_tree[['CommandLine']].head()

Contents

Looking for IoC in a String

Here we:

  • Get a commandline from our data set.

  • Pass it to the IoC Extractor

  • View the results

# get a commandline from our data set cmdline = process_tree['CommandLine'].loc[78] cmdline
'netsh start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Temp\\\\bzzzzzz.txt'
# Instantiate an IoCExtract object from msticpy.sectools import IoCExtract ioc_extractor = IoCExtract() # any IoCs in the string? iocs_found = ioc_extractor.extract(cmdline) if iocs_found: print('\nPotential IoCs found in alert process:') display(iocs_found)
Potential IoCs found in alert process:
defaultdict(set, {'ipv4': {'1.2.3.4'}, 'windows_path': {'C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Temp\\\\bzzzzzz.txt'}})

Contents

If we have a DataFrame, look for IoCs in the whole data set

You can replace the data= parameter to ioc_extractor.extract() to pass other data frames. Use the columns parameter to specify which column or columns that you want to search.

ioc_extractor = IoCExtract() ioc_df = ioc_extractor.extract(data=process_tree, columns=['CommandLine']) if len(ioc_df): display(HTML("<h3>IoC patterns found in process tree.</h3>")) display(ioc_df)

Contents

IoCExtractor API

# IoCExtract docstring ioc_extractor.extract?

Contents

Predefined Regex Patterns

from html import escape extractor = IoCExtract() for ioc_type, pattern in extractor.ioc_types.items(): esc_pattern = escape(pattern.comp_regex.pattern.strip()) display(HTML(f'<b>{ioc_type}</b>')) display(HTML(f'<div style="margin-left:20px"><pre>{esc_pattern}</pre></div>'))

Contents

Adding your own pattern(s)

Docstring:

Add an IoC type and regular expression to use to the built-in set. Parameters ---------- ioc_type : str A unique name for the IoC type ioc_regex : str A regular expression used to search for the type priority : int, optional Priority of the regex match vs. other ioc_patterns. 0 is the highest priority (the default is 0). group : str, optional The regex group to match (the default is None, which will match on the whole expression) Notes ----- Pattern priorities. If two IocType patterns match on the same substring, the matched substring is assigned to the pattern/IocType with the highest priority. E.g. `foo.bar.com` will match types: `dns`, `windows_path` and `linux_path` but since `dns` has a higher priority, the expression is assigned to the `dns` matches.
import re rcomp = re.compile(r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)')
extractor.add_ioc_type(ioc_type='win_named_pipe', ioc_regex=r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)') # Check that it added ok print(extractor.ioc_types['win_named_pipe']) # Use it in our data set ioc_extractor.extract(data=process_tree, columns=['CommandLine']).query('IoCType == \'win_named_pipe\'')
IoCPattern(ioc_type='win_named_pipe', comp_regex=re.compile('(?P<pipe>\\\\\\\\\\.\\\\pipe\\\\[^\\s\\\\]+)', re.IGNORECASE|re.MULTILINE|re.VERBOSE), priority=0, group=None)

Contents

extract() method

Parameters ---------- src : str, optional source string in which to look for IoC patterns (the default is None) data : pd.DataFrame, optional input DataFrame from which to read source strings (the default is None) columns : list, optional The list of columns to use as source strings, if the `data` parameter is used. (the default is None) Other Parameters ---------------- ioc_types : list, optional Restrict matching to just specified types. (default is all types) include_paths : bool, optional Whether to include path matches (which can be noisy) (the default is false - excludes 'windows_path' and 'linux_path'). If `ioc_types` is specified this parameter is ignored. Returns ------- Any dict of found observables (if input is a string) or DataFrame of observables Notes ----- Extract takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted. IoCType Pattern selection The default list is: ['ipv4', 'ipv6', 'dns', 'url', 'md5_hash', 'sha1_hash', 'sha256_hash'] plus any user-defined types. 'windows_path', 'linux_path' are excluded unless `include_paths` is True or explicitly included in `ioc_paths`.
# You can specify multiple columns ioc_extractor.extract(data=process_tree, columns=['NewProcessName', 'CommandLine']).head(10)

extract_df()

extract_df functions identically to extract with a data parameter. It may be more convenient to use this when you know that your input is a DataFrame

ioc_extractor.extract_df(process_tree, columns=['NewProcessName', 'CommandLine']).head(10)

Contents

SourceIndex column allows you to merge the results with the input DataFrame

Where an input row has multiple IoC matches the output of this merge will result in duplicate rows from the input (one per IoC match). The previous index is preserved in the second column (and in the SourceIndex column).

Note: you will need to set the type of the SourceIndex column. In the example below case we are matching with the default numeric index so we force the type to be numeric. In cases where you are using an index of a different dtype you will need to convert the SourceIndex (dtype=object) to match the type of your index column.

input_df = data=process_tree.head(20) output_df = ioc_extractor.extract(data=input_df, columns=['NewProcessName', 'CommandLine']) # set the type of the SourceIndex column. In this case we are matching with the default numeric index. output_df['SourceIndex'] = pd.to_numeric(output_df['SourceIndex']) merged_df = pd.merge(left=input_df, right=output_df, how='outer', left_index=True, right_on='SourceIndex') merged_df.head()

IPython magic

You can use the line magic %ioc or cell magic %%ioc to extract IoCs from text pasted directly into a cell

The ioc magic supports the following options:

--out OUT, -o OUT The variable to return the results in the variable `OUT` Note: the output variable is a dictionary iocs grouped by IoC Type --ioc_types IOC_TYPES, -i IOC_TYPES The types of IoC to search for (comma-separated string)
%%ioc --out ioc_capture netsh start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\Users\user\AppData\Local\Temp\bzzzzzz.txt hostname customers-service.ddns.net Feb 5, 2020, 2:20:35 PM 7 URL https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password Feb 5, 2020, 2:20:35 PM 1 hostname mobile.phonechallenges-submit.site Feb 5, 2020, 2:20:35 PM 8 hostname youtube.service-activity-checkup.site Feb 5, 2020, 2:20:35 PM 8 hostname www.drive-accounts.com Feb 5, 2020, 2:20:35 PM 7 hostname google.drive-accounts.com Feb 5, 2020, 2:20:35 PM 7 domain niaconucil.org Feb 5, 2020, 2:20:35 PM 11 domain isis-online.net Feb 5, 2020, 2:20:35 PM 11 domain bahaius.info Feb 5, 2020, 2:20:35 PM 11 domain w3-schools.org Feb 5, 2020, 2:20:35 PM 12 domain system-services.site Feb 5, 2020, 2:20:35 PM 11 domain accounts-drive.com Feb 5, 2020, 2:20:35 PM 8 domain drive-accounts.com Feb 5, 2020, 2:20:35 PM 10 domain service-issues.site Feb 5, 2020, 2:20:35 PM 8 domain two-step-checkup.site Feb 5, 2020, 2:20:35 PM 8 domain customers-activities.site Feb 5, 2020, 2:20:35 PM 11 domain seisolarpros.org Feb 5, 2020, 2:20:35 PM 11 domain yah00.site Feb 5, 2020, 2:20:35 PM 4 domain skynevvs.com Feb 5, 2020, 2:20:35 PM 11 domain recovery-options.site Feb 5, 2020, 2:20:35 PM 4 domain malcolmrifkind.site Feb 5, 2020, 2:20:35 PM 8 domain instagram-com.site Feb 5, 2020, 2:20:35 PM 8 domain leslettrespersanes.net Feb 5, 2020, 2:20:35 PM 11 domain software-updating-managers.site Feb 5, 2020, 2:20:35 PM 8 domain cpanel-services.site Feb 5, 2020, 2:20:35 PM 8 domain service-activity-checkup.site Feb 5, 2020, 2:20:35 PM 7 domain inztaqram.ga Feb 5, 2020, 2:20:35 PM 8 domain unirsd.com Feb 5, 2020, 2:20:35 PM 8 domain phonechallenges-submit.site Feb 5, 2020, 2:20:35 PM 7 domain acconut-verify.com Feb 5, 2020, 2:20:35 PM 11 domain finance-usbnc.info Feb 5, 2020, 2:20:35 PM 8 FileHash-MD5 542128ab98bda5ea139b169200a50bce Feb 5, 2020, 2:20:35 PM 3 FileHash-MD5 3d67ce57aab4f7f917cf87c724ed7dab Feb 5, 2020, 2:20:35 PM 3 hostname x09live-ix3b.account-profile-users.info Feb 6, 2020, 2:56:07 PM 0 hostname www.phonechallenges-submit.site Feb 6, 2020, 2:56:07 PM
[('ipv4', ['1.2.3.4']), ('dns', ['malcolmrifkind.site', 'w3-schools.org', 'niaconucil.org', 'software-updating-managers.site', 'isis-online.net', 'accounts-drive.com', 'cpanel-services.site', 'service-activity-checkup.site', 'service-issues.site', 'recovery-options.site', 'instagram-com.site', 'mobile.phonechallenges-submit.site', 'youtube.service-activity-checkup.site', 'google.drive-accounts.com', 'phonechallenges-submit.site', 'drive-accounts.com', 'www.phonechallenges-submit.site', 'yah00.site', 'seisolarpros.org', 'customers-activities.site', 'bahaius.info', 'system-services.site', 'two-step-checkup.site', 'x09live-ix3b.account-profile-users.info', 'customers-service.ddns.net', 'leslettrespersanes.net', 'www.drive-accounts.com', 'acconut-verify.com', 'finance-usbnc.info', 'unirsd.com', 'skynevvs.com', 'inztaqram.ga']), ('url', ['https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password']), ('windows_path', ['C:\\Users\\user\\AppData\\Local\\Temp\\bzzzzzz.txt']), ('linux_path', ['//two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password\t\tFeb']), ('md5_hash', ['3d67ce57aab4f7f917cf87c724ed7dab', '542128ab98bda5ea139b169200a50bce'])]
# Summarize captured types print([(ioc, len(matches)) for ioc, matches in ioc_capture.items()])
[('ipv4', 1), ('dns', 32), ('url', 1), ('windows_path', 1), ('linux_path', 1), ('md5_hash', 2)]
%%ioc --ioc_types "ipv4, ipv6, linux_path, md5_hash" netsh start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\Users\user\AppData\Local\Temp\bzzzzzz.txt tracefile2=/usr/localbzzzzzz.sh hostname customers-service.ddns.net Feb 5, 2020, 2:20:35 PM 7 URL https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password Feb 5, 2020, 2:20:35 PM 1 hostname mobile.phonechallenges-submit.site Feb 5, 2020, 2:20:35 PM 8 hostname youtube.service-activity-checkup.site Feb 5, 2020, 2:20:35 PM 8 hostname www.drive-accounts.com Feb 5, 2020, 2:20:35 PM 7 hostname google.drive-accounts.com Feb 5, 2020, 2:20:35 PM 7 domain niaconucil.org Feb 5, 2020, 2:20:35 PM 11 domain isis-online.net Feb 5, 2020, 2:20:35 PM 11 domain bahaius.info Feb 5, 2020, 2:20:35 PM 11 domain w3-schools.org Feb 5, 2020, 2:20:35 PM 12 domain system-services.site Feb 5, 2020, 2:20:35 PM 11 domain accounts-drive.com Feb 5, 2020, 2:20:35 PM 8 domain drive-accounts.com Feb 5, 2020, 2:20:35 PM 10 domain service-issues.site Feb 5, 2020, 2:20:35 PM 8 domain two-step-checkup.site Feb 5, 2020, 2:20:35 PM 8 domain customers-activities.site Feb 5, 2020, 2:20:35 PM 11 domain seisolarpros.org Feb 5, 2020, 2:20:35 PM 11 domain yah00.site Feb 5, 2020, 2:20:35 PM 4 domain skynevvs.com Feb 5, 2020, 2:20:35 PM 11 domain recovery-options.site Feb 5, 2020, 2:20:35 PM 4 domain malcolmrifkind.site Feb 5, 2020, 2:20:35 PM 8 domain instagram-com.site Feb 5, 2020, 2:20:35 PM 8 domain leslettrespersanes.net Feb 5, 2020, 2:20:35 PM 11 domain software-updating-managers.site Feb 5, 2020, 2:20:35 PM 8 domain cpanel-services.site Feb 5, 2020, 2:20:35 PM 8 domain service-activity-checkup.site Feb 5, 2020, 2:20:35 PM 7 domain inztaqram.ga Feb 5, 2020, 2:20:35 PM 8 domain unirsd.com Feb 5, 2020, 2:20:35 PM 8 domain phonechallenges-submit.site Feb 5, 2020, 2:20:35 PM 7 domain acconut-verify.com Feb 5, 2020, 2:20:35 PM 11 domain finance-usbnc.info Feb 5, 2020, 2:20:35 PM 8 FileHash-MD5 542128ab98bda5ea139b169200a50bce Feb 5, 2020, 2:20:35 PM 3 FileHash-MD5 3d67ce57aab4f7f917cf87c724ed7dab Feb 5, 2020, 2:20:35 PM 3 hostname x09live-ix3b.account-profile-users.info Feb 6, 2020, 2:56:07 PM 0 hostname www.phonechallenges-submit.site Feb 6, 2020, 2:56:07 PM
[('ipv4', ['1.2.3.4']), ('linux_path', ['//two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password\t\tFeb', '/usr/localbzzzzzz.sh']), ('md5_hash', ['3d67ce57aab4f7f917cf87c724ed7dab', '542128ab98bda5ea139b169200a50bce'])]

Pandas Extension

The decoding functionality is also available in a pandas extension mp_ioc. This supports a single method extract().

This supports the same syntax as extract (described earlier).

process_tree.mp_ioc.extract(columns=['CommandLine'])