Path: blob/master/Advertools/Advertools_Crawl_a_website.ipynb
2973 views
Advertools - Crawling a website
Tags: #advertools #adviz #crawling #website #analyze #seo #URL #audit #scraping #scrapy
Author: Elias Dabbas
Last update: 2023-07-20 (Created: 2023-07-20)
Description: This notebook demonstrates how to crawl a website, starting with one of its pages, and discover and follow all links as well.
Convert a website to a CSV file
Follow links with certain conditions:
Whether or not a link matches a certain regex
Whether or not a link contains a certain query parameter(s)
Extract special elements from pages using CSS/XPath selectors
Manage your crawling process with advanced settings (number of concurrent requests, when to stop crawling, proxies, and much more)
Input
Install libraries
If running it on naas, run the code below to uninstall (bug) and install the libraries
Import libraries
Setup Variables
url_list
: One or more URLs to start crawling from (typically the home page, but not necessarily)output_file
: The path to the file where you want to save your crawl datafollow_links
: Whether or not to follow links on each page that you crawlallowed_domains
: A list of domains to iclude in the crawl. By default the URLs inurl_list
, and all sub-domains under them will be included, but you can customize/restric this furtherexclude_url_params
: If a link contains any of those parameters don't follow itinclude_url_params
: If a link contains any of those parameters DO follow itexclude_url_regex
: If a link matches this regex don't follow itinclude_url_regex
: If a link matches this regex DO follow itcss_selectors
: A dictionary of CSS selectors for special data to be extracted from crawled pagesxpath_selectors
: A dictionary of XPath selectors for special data to be extracted from crawled pagescustom_settings
: Many optinos are available, some of the important one can be found below