GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Check_website_pages_status_code.ipynb
²⁹⁷³ views

Kernel: Python 3

Advertools - Check website pages status code

Give Feedback | Bug report

Tags: #advertools #website #status #code #check #pages

Author: Florent Ravenel

Last update: 2023-08-04 (Created: 2023-08-04)

Description: This notebook crawls your website and checks the status code of all pages. It starts from the home page and discovers URLs by following links within the website. It is a useful tool for quickly checking the status of your website and generating a report to take necessary actions.

References:

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

In [ ]:

# !pip uninstall -y scrapy attrs
# !pip install advertools adviz pandas==1.5.3 --user

Import libraries

In [ ]:

try:
    import advertools as adv
except ModuleNotFoundError:
    !pip install advertools --user
    import advertools as adv
from datetime import datetime
import naas
from naas_drivers import emailbuilder, naasauth
import plotly.express as px
import pandas as pd
try:
    import adviz
except ModuleNotFoundError:
    !pip install adviz --user
    import adviz
import os
pd.options.display.max_columns = None

Setup variables

Mandatory

website_url: URL of the website page to check
cron: We use CRON tasks to schedule notebooks, find the syntax you need to on: https://crontab.guru/
email_to: Represents the recipient(s) of the email. By default, your email account on naas will be set.

Optional

output_dir: Represents the output directory for the website crawl.
timestamp: Represents the timestamp when the code is executed.
output_website_crawl: Represents the output file name for the website crawl.
output_website_crawl_log: Represents the output file name for the log file of the website crawl.
output_status_code_ko: Represents the output file name for the status code report.
subject: Represents the subject line for the email.

In [ ]:

# Mandatory
website_url = "https://example.com/"
cron = "0 0 * * *" # This notebook will run everyday at 0:00
email_to = [naasauth.connect().user.me().get("username")]

# Optional
output_dir = website_url.split("https://")[-1].split("/")[0]
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
output_website_crawl = f'{timestamp}_website_crawl.jl'
output_website_crawl_log = f'{timestamp}_website_crawl.log'
output_status_code_ko = f'{timestamp}_status_code_ko.csv'
subject = f"Status code report: {website_url} as of {datetime.now().strftime('%Y-%m-%d')}"

Model

Define output paths

Create the output directory and define paths for the output files.

In [ ]:

# Check if directory exists and create it if not
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
# Generate outputs files path
output_website_crawl_path = os.path.join(output_dir, output_website_crawl)
output_website_crawl_log_path  = os.path.join(output_dir, output_website_crawl_log)
output_status_code_ko_path  = os.path.join(output_dir, output_status_code_ko)

Crawl website

In [ ]:

crawl_params = dict(
    url_list=website_url,
    output_file=output_website_crawl_path,  # has to end with .jl
    follow_links=True,  # the default is False
    allowed_domains=None,
    exclude_url_params=None,
    include_url_params=None,
    exclude_url_regex=None,
    include_url_regex=None,
    css_selectors=None,
    xpath_selectors=None,
    custom_settings= {
        'LOG_FILE': output_website_crawl_log_path,
        'CLOSESPIDER_PAGECOUNT': 0,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
        'DEFAULT_REQUEST_HEADERS': {},
        'DEPTH_LIMIT': 0,
        'USER_AGENT': adv.spider.user_agent
    }
)
adv.crawl(**crawl_params)

Read crawl DataFrame

In [ ]:

crawl_df = pd.read_json(output_website_crawl_path, lines=True)
print("Row fetched:", len(crawl_df))
crawl_df.head(1)

Create DataFrame on status code KO

In [ ]:

df_ko = crawl_df[~crawl_df["status"].isin([200, 201, 202, 203])].reset_index(drop=True)
print("Status code KO:", len(df_ko))

Create email content

In [ ]:

total_urls = len(crawl_df)
status_ko = len(df_ko)
status_ok = total_urls - status_ko

email_content = {
    "text1": emailbuilder.text("Dear Team,"),
    "text2": emailbuilder.text(f"We are sharing the report of status codes for the recent analysis on {website_url}:"),
    "list": emailbuilder.list([
        f"Total OK ✅: {status_ok}",
        f"Total KO ❌: {status_ko}",
    ]),
    "text3": emailbuilder.text("Please find enclosed the detailed of the status code failed."),
    "text4": emailbuilder.text("Please take appropriate actions to address these issues."),
    "text5": emailbuilder.text("Thank you!"),
    "text6": emailbuilder.text("Best regards,"),
}

content = emailbuilder.generate(
    display="iframe",
    **email_content
)

Output

Save report in CSV

In [ ]:

df_ko.to_csv(output_status_code_ko_path, index=False)

Send notification

In [ ]:

if len(email_to) > 0:
    naas.notification.send(email_to, subject, content, files=[output_status_code_ko_path])

Schedule notebook

In [ ]:

naas.scheduler.add(cron=cron)

# naas.scheduler.delete()