Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Audit_robots_txt_and_xml_sitemap_issues.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Audit robots txt and xml sitemap issues

Give Feedback | Bug report

Tags: #advertools #xml #sitemap #website #audit #seo #robots.txt #google

Author: Elias Dabbas

Last update: 2023-05-30 (Created: 2023-05-29)

Description: This notebook helps you check if there are any conflicts between robots.txt rules and your XML sitemap.

  • Are you disallowing URLs that you shouldn't?

  • Test and make sure you don't publish new pages with such conflicts.

  • Do this in bulk: for all URL/rule/user-agent combinations run all tests with one command.

Input

Install libraries

If running it on naas, run the code below to install the libraries

#!pip install advertools adviz pandas==1.5.3 --user

Import libraries

import advertools as adv

Setup Variables

  • robotstxt_url: URL of the robots.txt file to convert to a DataFrame

robotstxt_url = "https://www.youtube.com/robots.txt"

Model

Analyze potential robots.txt and XML conflicts

Getting the robots.txt file and converting it to a DataFrame.

robots_df = adv.robotstxt_to_df(robotstxt_url=robotstxt_url) robots_df

Get XML sitemap(s) and convert to a DataFrame.

sitemap = adv.sitemap_to_df( # the function will extract and combine all available sitemaps # in the robots.txt file robotstxt_url, max_workers=8, recursive=True) sitemap

Testing robots.txt

For all URL/user-agent combinations check if the URL is blocked.

user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content'] user_agents

Generate the robots.txt test report:

# Get users agent user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content'] print(user_agents) # Testing robots.txt robots_report = adv.robotstxt_test( robotstxt_url=robotstxt_url, user_agents=user_agents, urls=sitemap['loc'].dropna() ) print("Row fetched:", len(robots_report)) robots_report.head(5)

Does the website have URLs listed in the XML sitemap that are also disallowed by its robots.txt?

(this is not necessarily a problem, because they might disallow it for some user-agents only), and it's good to check.

Output

Get the URLs that cannot be fetched

Filter result

df_report = robots_report[~robots_report['can_fetch']].reset_index(drop=True) print("Row fetched:", len(df_report)) df_report.head(5)