Advertools - Crawling a website

Give Feedback | Bug report

Tags: #advertools #adviz #crawling #website #analyze #seo #URL #audit #scraping #scrapy

Author: Elias Dabbas

Last update: 2023-07-20 (Created: 2023-07-20)

Description: This notebook demonstrates how to crawl a website, starting with one of its pages, and discover and follow all links as well.

Convert a website to a CSV file
Follow links with certain conditions:
- Whether or not a link matches a certain regex
- Whether or not a link contains a certain query parameter(s)
Extract special elements from pages using CSS/XPath selectors
Manage your crawling process with advanced settings (number of concurrent requests, when to stop crawling, proxies, and much more)

References:

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

In [1]:

# !pip uninstall -y scrapy attrs
# !pip install advertools adviz pandas==1.5.3 --user

Import libraries

In [ ]:

try:
    import advertools as adv
except ModuleNotFoundError:
    !pip install advertools
    import advertools as adv
import pandas as pd
pd.options.display.max_columns = None

Setup Variables

url_list: One or more URLs to start crawling from (typically the home page, but not necessarily)
output_file: The path to the file where you want to save your crawl data
follow_links: Whether or not to follow links on each page that you crawl
allowed_domains: A list of domains to iclude in the crawl. By default the URLs in url_list, and all sub-domains under them will be included, but you can customize/restric this further
exclude_url_params: If a link contains any of those parameters don't follow it
include_url_params: If a link contains any of those parameters DO follow it
exclude_url_regex: If a link matches this regex don't follow it
include_url_regex: If a link matches this regex DO follow it
css_selectors: A dictionary of CSS selectors for special data to be extracted from crawled pages
xpath_selectors: A dictionary of XPath selectors for special data to be extracted from crawled pages
custom_settings: Many optinos are available, some of the important one can be found below

In [ ]:

crawl_params = dict(
    url_list = ['https://example.com'],
    output_file = 'website_crawl_YYYY_MM_DD.jl',  # has to end with .jl
    follow_links = True,  # the default is False
    allowed_domains = None,
    exclude_url_params = None,
    include_url_params = None,
    exclude_url_regex = None,
    include_url_regex = None,
    css_selectors = None,
    xpath_selectors = None,
    custom_settings = {
        'LOG_FILE': 'website_crawl_YYYY_MM_DD.log',
        'CLOSESPIDER_PAGECOUNT': 0,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
        'DEFAULT_REQUEST_HEADERS': {},
        'DEPTH_LIMIT': 0,
        'USER_AGENT': adv.spider.user_agent
    }
)

Model

Crawl the given website given the chosen options

In [ ]:

adv.crawl(**crawl_params)

Output

Read and analyze the crawl DataFrame

In [ ]:

crawldf = pd.read_json(crawl_params['output_file'], lines=True)
crawldf