GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Analyze_website_content_using_XML_sitemap.ipynb
²⁹⁷³ views

Kernel: Python 3

Advertools - Analyze website content using XML sitemap

Give Feedback | Bug report

Tags: #advertools #xml #sitemap #website #analyze #seo

Author: Elias Dabbas

Last update: 2023-05-23 (Created: 2023-05-09)

Description: This notebook helps you get an overview of a website's content by analyzing and visualizing its XML sitemap. It's also an important SEO audit process that can uncover some potential issues that might affect the website.

References:

Input

Install libraries

If running it on naas, run the code below to install the libraries

In [ ]:

#!pip install advertools adviz pandas==1.5.3 --user

Import libraries

In [ ]:

import advertools as adv
import adviz
from urllib.parse import urlsplit

Setup Variables

sitemap_url: URL of the sitemap to analyze, which can be
- The URL of an XML sitemap
- The URL of an XML sitemapindex
- The URL of a robots.txt file
- Normal and zipped formats are supported
recursive: If this is a sitemapindex, should all the sub-sitemaps also be downloaded, parsed and combined into one DataFrame?
max_workers: Number of concurrent workers to fetch the sitemaps.

In [ ]:

sitemap_url = "https://blog.sriniketh.design/sitemap.xml"
recursive = True
max_workers = 8

Model

Analyze website content using XML sitemap

Getting the sitemap(s)

In [ ]:

sitemap = adv.sitemap_to_df(
    sitemap_url=sitemap_url,
    max_workers=max_workers,
    recursive=recursive
)
sitemap

Split URLs into their components for further analysis/understanding

In [ ]:

urldf = adv.url_to_df(sitemap['loc'])
urldf

Output

Display results

Errors

In [ ]:

if 'errors' in sitemap:
    from IPython.display import display
    display(sitemap[sitemap['errors'].notnull()])
else:
    print('No errors found')

Duplicated URLs

In [ ]:

duplicated = sitemap[sitemap['loc'].duplicated()]
if not duplicated.empty:
    display(duplicated)
else:
    print('No duplicated URLs found')

URL counts per sitemap and sitemap sizes

Each sitemap should have a maximumof 50,000 URLs, and its size should not exceek 50MB

URL counts:

In [ ]:

adviz.value_counts_plus(sitemap['sitemap'], name='Sitemap URLs')

URL Sizes:

In [ ]:

sitemap['sitemap_size_mb'].describe().to_frame().T.style.format('{:,.2f}')

Count unique values of URL components

In [ ]:

for col in ['scheme', 'netloc', 'dir_1', 'dir_2', 'dir_3']:
    try:
        display(adviz.value_counts_plus(urldf[col], name=col))
    except Exception as e:
        continue

Visualize the structure of the URLs

In [ ]:

domain = urlsplit(sitemap_url).netloc
try:
    adviz.url_structure(
        urldf['url'].fillna(''),
        items_per_level=30,
        domain=domain,
        height=750,
        title=f'URL Structure: {domain} XML sitemap'
    )
except Exception as e:
    print(str(e))
    pass

In [ ]: