Path: blob/master/Advertools/Advertools_Audit_robots_txt_and_xml_sitemap_issues.ipynb
2973 views
Advertools - Audit robots txt and xml sitemap issues
Tags: #advertools #xml #sitemap #website #audit #seo #robots.txt #google
Author: Elias Dabbas
Last update: 2023-05-30 (Created: 2023-05-29)
Description: This notebook helps you check if there are any conflicts between robots.txt rules and your XML sitemap.
Are you disallowing URLs that you shouldn't?
Test and make sure you don't publish new pages with such conflicts.
Do this in bulk: for all URL/rule/user-agent combinations run all tests with one command.
Input
Install libraries
If running it on naas, run the code below to install the libraries
Import libraries
Setup Variables
robotstxt_url
: URL of the robots.txt file to convert to aDataFrame
Model
Analyze potential robots.txt and XML conflicts
Getting the robots.txt file and converting it to a DataFrame
.
Get XML sitemap(s) and convert to a DataFrame
.
Testing robots.txt
For all URL/user-agent combinations check if the URL is blocked.
Generate the robots.txt test report:
Does the website have URLs listed in the XML sitemap that are also disallowed by its robots.txt?
(this is not necessarily a problem, because they might disallow it for some user-agents only), and it's good to check.
Output
Get the URLs that cannot be fetched