Path: blob/master/Advertools/Advertools_Check_status_code_in_bulk.ipynb
2973 views
Advertools - Check status code in bulk
Tags: #advertools #adviz #website #analyze #audit #seo #status_code #response_headers
Author: Elias Dabbas
Last update: 2023-07-31 (Created: 2023-07-20)
Description: This notebook runs an automated status code checker with response headers using the HTTP HEAD
method.
Bulk and concurrent checking of status codes for a known list of URLs
Get all available response headers from all URLs
Set speed, number of concurent request and various other crawling options
Does NOT download the full HTML of a page, saving a lot of time, energy, and resources, and enabling an extreemely fast and light process
References:
Input
Install libraries
If running it on naas, run the code below to uninstall (bug) and install the libraries
Import libraries
Setup Variables
url_list
: List of URLs to check the status codesCONCURRENT_REQUESTS_PER_DOMAIN
: Defaults to 8. Most likely you will need to slow it down, because the crawling is extremely fast.DEFAULT_REQUEST_HEADERS
: A dictionary where you can set custom request headers.USER_AGENT
: Set a different/custom user agent if you need to. By default, we are using the one in advertools.AUTOTHROTTLE_ENABLED
: Whether or not to dynamically change the pace of crawling to be nice to servers as much as possible. It's usually good to enable this extension, as this type of crawling is extremely fast, often leading to 429 (too many requests code), or being blocked.LOG_FILE
: Log file path. Please update it everytime you checked a new list.output_file
: The path to the file for saving the output. It has to be in the '.jl' format. Note that new lines to the file are 'appended' to the end and the file is not overwritten while crawling.
Model
Check status codes and retrieve response headers
Output
Visualize Status codes OK and KO
Visualize status codes count
Visualize URL structure
Get redirects
Visualize download latency
The same charts can be used for any other float
columns in headers_df
if available.