GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Check_status_code_in_bulk.ipynb
²⁹⁷³ views

Kernel: Python 3

Advertools - Check status code in bulk

Give Feedback | Bug report

Tags: #advertools #adviz #website #analyze #audit #seo #status_code #response_headers

Author: Elias Dabbas

Last update: 2023-07-31 (Created: 2023-07-20)

Description: This notebook runs an automated status code checker with response headers using the HTTP HEAD method.

Bulk and concurrent checking of status codes for a known list of URLs
Get all available response headers from all URLs
Set speed, number of concurent request and various other crawling options
Does NOT download the full HTML of a page, saving a lot of time, energy, and resources, and enabling an extreemely fast and light process

References:

advertools crawl_headers function
Scrapy throttling and custom settings

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

In [1]:

# !pip uninstall -y scrapy attrs
# !pip install advertools adviz pandas==1.5.3 --user

Import libraries

In [2]:

try:
    import adviz
except ModuleNotFoundError:
    !pip install adviz --user
    import adviz
try:
    import advertools as adv
except ModuleNotFoundError:
    !pip install advertools --user
    import advertools as adv
from datetime import datetime
import plotly.express as px
import pandas as pd
pd.options.display.max_columns = None

Setup Variables

url_list: List of URLs to check the status codes
CONCURRENT_REQUESTS_PER_DOMAIN: Defaults to 8. Most likely you will need to slow it down, because the crawling is extremely fast.
DEFAULT_REQUEST_HEADERS: A dictionary where you can set custom request headers.
USER_AGENT: Set a different/custom user agent if you need to. By default, we are using the one in advertools.
AUTOTHROTTLE_ENABLED: Whether or not to dynamically change the pace of crawling to be nice to servers as much as possible. It's usually good to enable this extension, as this type of crawling is extremely fast, often leading to 429 (too many requests code), or being blocked.
LOG_FILE: Log file path. Please update it everytime you checked a new list.
output_file: The path to the file for saving the output. It has to be in the '.jl' format. Note that new lines to the file are 'appended' to the end and the file is not overwritten while crawling.

In [3]:

# Inputs
url_list = [
    "https://www.naas.ai/",
    "https://www.naas.ai/free-forever",
    "https://app.naas.ai/user/[email protected]",
    "https://www.naas.ai/test",
]
domain = 'naas.ai'
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DEFAULT_REQUEST_HEADERS = {}
USER_AGENT = adv.spider.user_agent
AUTOTHROTTLE_ENABLED = True
LOG_FILE = 'crawl_headers_output.log'

# Outputs
output_file = f'{datetime.now().strftime("%Y%m%d%H%M%S")}_crawl_headers_output.jl'

Model

Check status codes and retrieve response headers

In [4]:

# Create custom settings dict
custom_settings = {
    # optionally change request headers:
    'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS,
    'AUTOTHROTTLE_ENABLED': AUTOTHROTTLE_ENABLED,
    'CONCURRENT_REQUESTS_PER_DOMAIN': CONCURRENT_REQUESTS_PER_DOMAIN,
    'USER_AGENT': USER_AGENT,
    'LOG_FILE': LOG_FILE,
}

# Crawl headers
adv.crawl_headers(
    url_list=url_list,
    output_file=output_file,
    custom_settings=custom_settings
)

# Open the crawl output file
headers_df = pd.read_json(output_file, lines=True)
headers_df

Output

Visualize Status codes OK and KO

In [5]:

adviz.status_codes(
    headers_df['status'],
    height=400
)

Visualize status codes count

In [6]:

adviz.value_counts_plus(
    headers_df['status'],
    name='Status codes',
    size=14
)

Visualize URL structure

In [7]:

adviz.url_structure(
    headers_df['url'],
    domain=domain,
    items_per_level=50,
    theme='plotly_dark'
)

Get redirects

In [8]:

if 'redirect_urls' in headers_df:
    redirect_df = headers_df.filter(regex='^url$|redirect_').dropna(thresh=4)
    redirect_df['redirect_urls'] = redirect_df['redirect_urls'].str.split('@@')
    redirect_df['redirect_reasons'] = redirect_df['redirect_reasons'].astype(int).astype(str).str.split('@@')
    redirect_df['redirect_chain'] = (redirect_df['redirect_urls']
                                     .str.join('@@')
                                     .add('@@')
                                     .add(redirect_df['url']).str.split('@@'))
    from IPython.display import display
    display(redirect_df)
else:
    print('No redirects found in this dataset')

Visualize download latency

The same charts can be used for any other float columns in headers_df if available.

In [9]:

fig = px.histogram(
    x=headers_df['download_latency'],
    template='none',
    labels={'x': 'Latency (seconds)<br>'},
    height=500,
    # experiment with different values for nbins:
    # nbins=10,
    title='<b>Download latency distribution</b>')

fig.add_annotation(
    yref='y domain',
    xref='paper',
    x=0,
    y=-0.25,
    showarrow=False,
    align='right',
    text='<b>Latency:</b> time elapsed between establishing the TCP connection and receiving the HTTP headers.')
fig

Create helper visualizations

In [10]:

def ecdf(df, column, template='none', width=None, height=500):
    df = df.sort_values(column)
    df['count below'] = range(len(df))
    df['count above'] = range(len(df)-1, -1, -1)
    df['total count'] = len(df)
    fig = px.ecdf(
        df.dropna(subset=[column]),
        x=column,
        markers=True, lines=False,
        template=template,
        height=height,
        width=width,
        hover_data=['count below', 'count above', 'total count'],
        title=f"Cumulative distribution of <b>{column.replace('_', ' ')}</b>",
        ecdfnorm='percent')
    fig.data[0].hovertemplate = f'<b>{column.replace("_", " ").title()}</b><br><br>{column}' + ': %{x}<br>percent: %{y}<br><br>count below: %{customdata[0]:,}<br>count above: %{customdata[1]:,}<br>total count: %{customdata[2]:,}<extra></extra>'
    fig.layout.yaxis.ticksuffix = '%'
    fig.layout.yaxis.showspikes = True
    fig.layout.xaxis.showspikes = True
    return fig

ecdf(df=headers_df, column='download_latency')