Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Check_status_code_in_bulk.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Check status code in bulk

Give Feedback | Bug report

Tags: #advertools #adviz #website #analyze #audit #seo #status_code #response_headers

Author: Elias Dabbas

Last update: 2023-07-31 (Created: 2023-07-20)

Description: This notebook runs an automated status code checker with response headers using the HTTP HEAD method.

  • Bulk and concurrent checking of status codes for a known list of URLs

  • Get all available response headers from all URLs

  • Set speed, number of concurent request and various other crawling options

  • Does NOT download the full HTML of a page, saving a lot of time, energy, and resources, and enabling an extreemely fast and light process

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

# !pip uninstall -y scrapy attrs # !pip install advertools adviz pandas==1.5.3 --user

Import libraries

try: import adviz except ModuleNotFoundError: !pip install adviz --user import adviz try: import advertools as adv except ModuleNotFoundError: !pip install advertools --user import advertools as adv from datetime import datetime import plotly.express as px import pandas as pd pd.options.display.max_columns = None

Setup Variables

  • url_list: List of URLs to check the status codes

  • CONCURRENT_REQUESTS_PER_DOMAIN: Defaults to 8. Most likely you will need to slow it down, because the crawling is extremely fast.

  • DEFAULT_REQUEST_HEADERS: A dictionary where you can set custom request headers.

  • USER_AGENT: Set a different/custom user agent if you need to. By default, we are using the one in advertools.

  • AUTOTHROTTLE_ENABLED: Whether or not to dynamically change the pace of crawling to be nice to servers as much as possible. It's usually good to enable this extension, as this type of crawling is extremely fast, often leading to 429 (too many requests code), or being blocked.

  • LOG_FILE: Log file path. Please update it everytime you checked a new list.

  • output_file: The path to the file for saving the output. It has to be in the '.jl' format. Note that new lines to the file are 'appended' to the end and the file is not overwritten while crawling.

# Inputs url_list = [ "https://www.naas.ai/", "https://www.naas.ai/free-forever", "https://app.naas.ai/user/[email protected]", "https://www.naas.ai/test", ] domain = 'naas.ai' CONCURRENT_REQUESTS_PER_DOMAIN = 4 DEFAULT_REQUEST_HEADERS = {} USER_AGENT = adv.spider.user_agent AUTOTHROTTLE_ENABLED = True LOG_FILE = 'crawl_headers_output.log' # Outputs output_file = f'{datetime.now().strftime("%Y%m%d%H%M%S")}_crawl_headers_output.jl'

Model

Check status codes and retrieve response headers

# Create custom settings dict custom_settings = { # optionally change request headers: 'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS, 'AUTOTHROTTLE_ENABLED': AUTOTHROTTLE_ENABLED, 'CONCURRENT_REQUESTS_PER_DOMAIN': CONCURRENT_REQUESTS_PER_DOMAIN, 'USER_AGENT': USER_AGENT, 'LOG_FILE': LOG_FILE, } # Crawl headers adv.crawl_headers( url_list=url_list, output_file=output_file, custom_settings=custom_settings ) # Open the crawl output file headers_df = pd.read_json(output_file, lines=True) headers_df

Output

Visualize Status codes OK and KO

adviz.status_codes( headers_df['status'], height=400 )

Visualize status codes count

adviz.value_counts_plus( headers_df['status'], name='Status codes', size=14 )

Visualize URL structure

adviz.url_structure( headers_df['url'], domain=domain, items_per_level=50, theme='plotly_dark' )

Get redirects

if 'redirect_urls' in headers_df: redirect_df = headers_df.filter(regex='^url$|redirect_').dropna(thresh=4) redirect_df['redirect_urls'] = redirect_df['redirect_urls'].str.split('@@') redirect_df['redirect_reasons'] = redirect_df['redirect_reasons'].astype(int).astype(str).str.split('@@') redirect_df['redirect_chain'] = (redirect_df['redirect_urls'] .str.join('@@') .add('@@') .add(redirect_df['url']).str.split('@@')) from IPython.display import display display(redirect_df) else: print('No redirects found in this dataset')

Visualize download latency

The same charts can be used for any other float columns in headers_df if available.

fig = px.histogram( x=headers_df['download_latency'], template='none', labels={'x': 'Latency (seconds)<br>'}, height=500, # experiment with different values for nbins: # nbins=10, title='<b>Download latency distribution</b>') fig.add_annotation( yref='y domain', xref='paper', x=0, y=-0.25, showarrow=False, align='right', text='<b>Latency:</b> time elapsed between establishing the TCP connection and receiving the HTTP headers.') fig

Create helper visualizations

def ecdf(df, column, template='none', width=None, height=500): df = df.sort_values(column) df['count below'] = range(len(df)) df['count above'] = range(len(df)-1, -1, -1) df['total count'] = len(df) fig = px.ecdf( df.dropna(subset=[column]), x=column, markers=True, lines=False, template=template, height=height, width=width, hover_data=['count below', 'count above', 'total count'], title=f"Cumulative distribution of <b>{column.replace('_', ' ')}</b>", ecdfnorm='percent') fig.data[0].hovertemplate = f'<b>{column.replace("_", " ").title()}</b><br><br>{column}' + ': %{x}<br>percent: %{y}<br><br>count below: %{customdata[0]:,}<br>count above: %{customdata[1]:,}<br>total count: %{customdata[2]:,}<extra></extra>' fig.layout.yaxis.ticksuffix = '%' fig.layout.yaxis.showspikes = True fig.layout.xaxis.showspikes = True return fig ecdf(df=headers_df, column='download_latency')