Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Analyze_website_content_using_XML_sitemap.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Analyze website content using XML sitemap

Give Feedback | Bug report

Tags: #advertools #xml #sitemap #website #analyze #seo

Author: Elias Dabbas

Last update: 2023-05-23 (Created: 2023-05-09)

Description: This notebook helps you get an overview of a website's content by analyzing and visualizing its XML sitemap. It's also an important SEO audit process that can uncover some potential issues that might affect the website.

Input

Install libraries

If running it on naas, run the code below to install the libraries

#!pip install advertools adviz pandas==1.5.3 --user

Import libraries

import advertools as adv import adviz from urllib.parse import urlsplit

Setup Variables

  • sitemap_url: URL of the sitemap to analyze, which can be

    • The URL of an XML sitemap

    • The URL of an XML sitemapindex

    • The URL of a robots.txt file

    • Normal and zipped formats are supported

  • recursive: If this is a sitemapindex, should all the sub-sitemaps also be downloaded, parsed and combined into one DataFrame?

  • max_workers: Number of concurrent workers to fetch the sitemaps.

sitemap_url = "https://blog.sriniketh.design/sitemap.xml" recursive = True max_workers = 8

Model

Analyze website content using XML sitemap

Getting the sitemap(s)

sitemap = adv.sitemap_to_df( sitemap_url=sitemap_url, max_workers=max_workers, recursive=recursive ) sitemap

Split URLs into their components for further analysis/understanding

urldf = adv.url_to_df(sitemap['loc']) urldf

Output

Display results

Errors

if 'errors' in sitemap: from IPython.display import display display(sitemap[sitemap['errors'].notnull()]) else: print('No errors found')

Duplicated URLs

duplicated = sitemap[sitemap['loc'].duplicated()] if not duplicated.empty: display(duplicated) else: print('No duplicated URLs found')

URL counts per sitemap and sitemap sizes

Each sitemap should have a maximumof 50,000 URLs, and its size should not exceek 50MB

URL counts:

adviz.value_counts_plus(sitemap['sitemap'], name='Sitemap URLs')

URL Sizes:

sitemap['sitemap_size_mb'].describe().to_frame().T.style.format('{:,.2f}')

Count unique values of URL components

for col in ['scheme', 'netloc', 'dir_1', 'dir_2', 'dir_3']: try: display(adviz.value_counts_plus(urldf[col], name=col)) except Exception as e: continue

Visualize the structure of the URLs

domain = urlsplit(sitemap_url).netloc try: adviz.url_structure( urldf['url'].fillna(''), items_per_level=30, domain=domain, height=750, title=f'URL Structure: {domain} XML sitemap' ) except Exception as e: print(str(e)) pass