Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Check_status_code_and_Send_notifications.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Check status code and Send report by email

Give Feedback | Bug report

Tags: #advertools #website #analyze #audit #seo #status_code #response_headers #naas #notification #scheduler

Last update: 2023-07-31 (Created: 2023-07-20)

Description: This notebook runs an automated status code checker with response headers using the HTTP HEAD method and send a report by email.

NB:

  • Bulk and concurrent checking of status codes for a known list of URLs

  • Get all available response headers from all URLs

  • Set speed, number of concurent request and various other crawling options

  • Does NOT download the full HTML of a page, saving a lot of time, energy, and resources, and enabling an extreemely fast and light process

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

# !pip uninstall -y scrapy attrs # !pip install advertools adviz pandas==1.5.3 --user

Import libraries

try: import advertools as adv except ModuleNotFoundError: !pip install advertools --user import advertools as adv from datetime import datetime import naas from naas_drivers import emailbuilder import plotly.express as px import pandas as pd pd.options.display.max_columns = None

Setup Variables

  • url_list: List of URLs to check the status codes

  • email_to: List of emails to send the report

  • cron: We use CRON tasks to schedule notebooks, find the syntax you need to on: https://crontab.guru/

  • subject: Email subject

  • output_file: The path to the file for saving the output. It has to be in the '.jl' format. Note that new lines to the file are 'appended' to the end and the file is not overwritten while crawling.

  • output_ko: Status KO in csv to be attached in report

# Inputs url_list = [ "https://www.naas.ai/", "https://www.naas.ai/free-forever", "https://www.naas.ai/test", ] email_to = [] cron = "0 0 * * *" # This notebook will run everyday at 0:00 subject = "Status code checker - Report 🚨" # Outputs output_file = f'{datetime.now().strftime("%Y%m%d%H%M%S")}_crawl_headers_output.jl' output_ko = "status_code_ko.csv"

Model

Check status codes and retrieve response headers

# Create custom settings dict custom_settings = { # optionally change request headers: 'DEFAULT_REQUEST_HEADERS': {}, 'AUTOTHROTTLE_ENABLED': True, 'CONCURRENT_REQUESTS_PER_DOMAIN': 4, 'USER_AGENT': adv.spider.user_agent, 'LOG_FILE': 'crawl_headers_output.log', } # Crawl headers adv.crawl_headers( url_list=url_list, output_file=output_file, custom_settings=custom_settings ) # Open the crawl output file headers_df = pd.read_json(output_file, lines=True) headers_df

Filter on status code KO and create list

df_ko = headers_df[~headers_df["status"].isin([200, 201, 202, 203])].reset_index(drop=True) urls = [] for row in df_ko.itertuples(): url = row.url status = row.status text = f"{url} - {status}" urls.append(text) urls

Create email content

email_content = { "text1": emailbuilder.text("Dear Team,"), "text2": emailbuilder.text("We are sharing the report of KO (Failed) status codes for the recent analysis:"), "list": emailbuilder.list(urls), "text3": emailbuilder.text("Please take appropriate actions to address these issues."), "text4": emailbuilder.text("Thank you!"), "text5": emailbuilder.text("Best regards,"), } content = emailbuilder.generate( display="iframe", **email_content )

Output

Save report in CSV

df_ko.to_csv(output_ko, index=False)

Send notification

naas.notification.send(email_to, subject, content, files=[output_ko])

Schedule notebook

naas.scheduler.add(cron=cron)