Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/Advertools/Advertools_Crawl_a_website.ipynb
2973 views
Kernel: Python 3

Advertools.png

Advertools - Crawling a website

Give Feedback | Bug report

Tags: #advertools #adviz #crawling #website #analyze #seo #URL #audit #scraping #scrapy

Author: Elias Dabbas

Last update: 2023-07-20 (Created: 2023-07-20)

Description: This notebook demonstrates how to crawl a website, starting with one of its pages, and discover and follow all links as well.

  • Convert a website to a CSV file

  • Follow links with certain conditions:

    • Whether or not a link matches a certain regex

    • Whether or not a link contains a certain query parameter(s)

  • Extract special elements from pages using CSS/XPath selectors

  • Manage your crawling process with advanced settings (number of concurrent requests, when to stop crawling, proxies, and much more)

Input

Install libraries

If running it on naas, run the code below to uninstall (bug) and install the libraries

# !pip uninstall -y scrapy attrs # !pip install advertools adviz pandas==1.5.3 --user

Import libraries

try: import advertools as adv except ModuleNotFoundError: !pip install advertools import advertools as adv import pandas as pd pd.options.display.max_columns = None

Setup Variables

  • url_list: One or more URLs to start crawling from (typically the home page, but not necessarily)

  • output_file: The path to the file where you want to save your crawl data

  • follow_links: Whether or not to follow links on each page that you crawl

  • allowed_domains: A list of domains to iclude in the crawl. By default the URLs in url_list, and all sub-domains under them will be included, but you can customize/restric this further

  • exclude_url_params: If a link contains any of those parameters don't follow it

  • include_url_params: If a link contains any of those parameters DO follow it

  • exclude_url_regex: If a link matches this regex don't follow it

  • include_url_regex: If a link matches this regex DO follow it

  • css_selectors: A dictionary of CSS selectors for special data to be extracted from crawled pages

  • xpath_selectors: A dictionary of XPath selectors for special data to be extracted from crawled pages

  • custom_settings: Many optinos are available, some of the important one can be found below

crawl_params = dict( url_list = ['https://example.com'], output_file = 'website_crawl_YYYY_MM_DD.jl', # has to end with .jl follow_links = True, # the default is False allowed_domains = None, exclude_url_params = None, include_url_params = None, exclude_url_regex = None, include_url_regex = None, css_selectors = None, xpath_selectors = None, custom_settings = { 'LOG_FILE': 'website_crawl_YYYY_MM_DD.log', 'CLOSESPIDER_PAGECOUNT': 0, 'CONCURRENT_REQUESTS_PER_DOMAIN': 8, 'DEFAULT_REQUEST_HEADERS': {}, 'DEPTH_LIMIT': 0, 'USER_AGENT': adv.spider.user_agent } )

Model

Crawl the given website given the chosen options

adv.crawl(**crawl_params)

Output

Read and analyze the crawl DataFrame

crawldf = pd.read_json(crawl_params['output_file'], lines=True) crawldf