Advertools - Audit robots txt and xml sitemap issues

Give Feedback | Bug report

Tags: #advertools #xml #sitemap #website #audit #seo #robots.txt #google

Author: Elias Dabbas

Last update: 2023-05-30 (Created: 2023-05-29)

Description: This notebook helps you check if there are any conflicts between robots.txt rules and your XML sitemap.

Are you disallowing URLs that you shouldn't?
Test and make sure you don't publish new pages with such conflicts.
Do this in bulk: for all URL/rule/user-agent combinations run all tests with one command.

References:

Input

Install libraries

If running it on naas, run the code below to install the libraries

In [ ]:

#!pip install advertools adviz pandas==1.5.3 --user

Import libraries

In [9]:

import advertools as adv

Setup Variables

robotstxt_url: URL of the robots.txt file to convert to a DataFrame

In [13]:

robotstxt_url = "https://www.youtube.com/robots.txt"

Model

Analyze potential robots.txt and XML conflicts

Getting the robots.txt file and converting it to a DataFrame.

In [14]:

robots_df = adv.robotstxt_to_df(robotstxt_url=robotstxt_url)
robots_df

Get XML sitemap(s) and convert to a DataFrame.

In [15]:

sitemap = adv.sitemap_to_df(
    # the function will extract and combine all available sitemaps
    # in the robots.txt file
    robotstxt_url,
    max_workers=8,
    recursive=True)
sitemap

Testing robots.txt

For all URL/user-agent combinations check if the URL is blocked.

In [17]:

user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content']
user_agents

Generate the robots.txt test report:

In [19]:

# Get users agent
user_agents = robots_df[robots_df['directive'].str.contains('user-agent', case=False)]['content']
print(user_agents)

# Testing robots.txt
robots_report = adv.robotstxt_test(
    robotstxt_url=robotstxt_url,
    user_agents=user_agents,
    urls=sitemap['loc'].dropna()
)

print("Row fetched:", len(robots_report))
robots_report.head(5)

Does the website have URLs listed in the XML sitemap that are also disallowed by its robots.txt?

(this is not necessarily a problem, because they might disallow it for some user-agents only), and it's good to check.

Output

Get the URLs that cannot be fetched

Filter result

In [20]:

df_report = robots_report[~robots_report['can_fetch']].reset_index(drop=True)
print("Row fetched:", len(df_report))
df_report.head(5)