Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
jupyter-naas
GitHub Repository: jupyter-naas/awesome-notebooks
Path: blob/master/BeautifulSoup/BeautifulSoup_Scrape_emails_from_URL.ipynb
2973 views
Kernel: Python 3

BeautifulSoup.png

BeautifulSoup - Scrape emails from URL

Give Feedback | Bug report

Tags: #beautifulsoup #python #scraping #emails #url #webscraping #html

Last update: 2023-04-12 (Created: 2023-02-16)

Description: This notebook will show how to scrape emails stored in HTML webpage using BeautifulSoup.

Input

Import libraries

import re import requests from urllib.parse import urlsplit from collections import deque from bs4 import BeautifulSoup import pandas as pd

Setup Variables

  • url: URL of the webpage to scrape

  • limit: number of emails found to stop scraping

url = "https://www.naas.ai/" limit = 3

Model

Scrape emails from URL

We will use the requests library to get the HTML content of the webpage and the BeautifulSoup library to parse the HTML content. We will use a regular expression to extract the emails from the HTML content.

unscraped = deque([url]) scraped = set() emails = set() while len(unscraped): url = unscraped.popleft() scraped.add(url) parts = urlsplit(url) base_url = "{0.scheme}://{0.netloc}".format(parts) if '/' in parts.path: path = url[:url.rfind('/')+1] else: path = url print("Crawling URL: %s" % url) try: response = requests.get(url) except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): continue exclude = ["google.com", "gmail.com", "example.com"] # Get emails from URL new_emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", url) for email in new_emails: for e in exclude: if not email.endswith(e): emails.update([email]) # Get emails from content new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", response.text, re.I)) for email in new_emails: for e in exclude: if not email.endswith(e): emails.update([email]) if len(emails) >= limit: break soup = BeautifulSoup(response.text, 'lxml') for anchor in soup.find_all("a"): if "href" in anchor.attrs: link = anchor.attrs["href"] else: link = '' if link.startswith('/'): link = base_url + link elif not link.startswith('http'): link = path + link if not link.endswith(".gz"): if not link in unscraped and not link in scraped: unscraped.append(link) print(emails)

Output

Display result

print(f"🚀 {len(emails)} founded on {url}") print(emails)