Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Check_website_pages_status_code.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Check website pages status code

Give Feedback | Bug report

Tags: #advertools #website #status #code #check #pages

Last update: 2023-08-04 (Created: 2023-08-04)

Description: This notebook crawls your website and checks the status code of all pages. It starts from the home page and discovers URLs by following links within the website. It is a useful tool for quickly checking the status of your website and generating a report to take necessary actions.

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

# !pip uninstall -y scrapy attrs # !pip install advertools adviz pandas==1.5.3 --user

Import libraries

try: import advertools as adv except ModuleNotFoundError: !pip install advertools --user import advertools as adv from datetime import datetime import naas from naas_drivers import emailbuilder, naasauth import plotly.express as px import pandas as pd try: import adviz except ModuleNotFoundError: !pip install adviz --user import adviz import os pd.options.display.max_columns = None

Setup variables

Mandatory

  • website_url: URL of the website page to check

  • cron: We use CRON tasks to schedule notebooks, find the syntax you need to on: https://crontab.guru/

  • email_to: Represents the recipient(s) of the email. By default, your email account on naas will be set.

Optional

  • output_dir: Represents the output directory for the website crawl.

  • timestamp: Represents the timestamp when the code is executed.

  • output_website_crawl: Represents the output file name for the website crawl.

  • output_website_crawl_log: Represents the output file name for the log file of the website crawl.

  • output_status_code_ko: Represents the output file name for the status code report.

  • subject: Represents the subject line for the email.

# Mandatory website_url = "https://example.com/" cron = "0 0 * * *" # This notebook will run everyday at 0:00 email_to = [naasauth.connect().user.me().get("username")] # Optional output_dir = website_url.split("https://")[-1].split("/")[0] timestamp = datetime.now().strftime("%Y%m%d%H%M%S") output_website_crawl = f'{timestamp}_website_crawl.jl' output_website_crawl_log = f'{timestamp}_website_crawl.log' output_status_code_ko = f'{timestamp}_status_code_ko.csv' subject = f"Status code report: {website_url} as of {datetime.now().strftime('%Y-%m-%d')}"

Model

Define output paths

Create the output directory and define paths for the output files.

# Check if directory exists and create it if not if not os.path.exists(output_dir): os.makedirs(output_dir) # Generate outputs files path output_website_crawl_path = os.path.join(output_dir, output_website_crawl) output_website_crawl_log_path = os.path.join(output_dir, output_website_crawl_log) output_status_code_ko_path = os.path.join(output_dir, output_status_code_ko)

Crawl website

crawl_params = dict( url_list=website_url, output_file=output_website_crawl_path, # has to end with .jl follow_links=True, # the default is False allowed_domains=None, exclude_url_params=None, include_url_params=None, exclude_url_regex=None, include_url_regex=None, css_selectors=None, xpath_selectors=None, custom_settings= { 'LOG_FILE': output_website_crawl_log_path, 'CLOSESPIDER_PAGECOUNT': 0, 'CONCURRENT_REQUESTS_PER_DOMAIN': 8, 'DEFAULT_REQUEST_HEADERS': {}, 'DEPTH_LIMIT': 0, 'USER_AGENT': adv.spider.user_agent } ) adv.crawl(**crawl_params)

Read crawl DataFrame

crawl_df = pd.read_json(output_website_crawl_path, lines=True) print("Row fetched:", len(crawl_df)) crawl_df.head(1)

Create DataFrame on status code KO

df_ko = crawl_df[~crawl_df["status"].isin([200, 201, 202, 203])].reset_index(drop=True) print("Status code KO:", len(df_ko))

Create email content

total_urls = len(crawl_df) status_ko = len(df_ko) status_ok = total_urls - status_ko email_content = { "text1": emailbuilder.text("Dear Team,"), "text2": emailbuilder.text(f"We are sharing the report of status codes for the recent analysis on {website_url}:"), "list": emailbuilder.list([ f"Total OK ✅: {status_ok}", f"Total KO ❌: {status_ko}", ]), "text3": emailbuilder.text("Please find enclosed the detailed of the status code failed."), "text4": emailbuilder.text("Please take appropriate actions to address these issues."), "text5": emailbuilder.text("Thank you!"), "text6": emailbuilder.text("Best regards,"), } content = emailbuilder.generate( display="iframe", **email_content )

Output

Save report in CSV

df_ko.to_csv(output_status_code_ko_path, index=False)

Send notification

if len(email_to) > 0: naas.notification.send(email_to, subject, content, files=[output_status_code_ko_path])

Schedule notebook

naas.scheduler.add(cron=cron) # naas.scheduler.delete()