BeautifulSoup - Scrape emails from URL

Give Feedback | Bug report

Tags: #beautifulsoup #python #scraping #emails #url #webscraping #html

Author: Florent Ravenel

Last update: 2023-04-12 (Created: 2023-02-16)

Description: This notebook will show how to scrape emails stored in HTML webpage using BeautifulSoup.

References:

Input

Import libraries

In [ ]:

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd

Setup Variables

url: URL of the webpage to scrape
limit: number of emails found to stop scraping

In [ ]:

url = "https://www.naas.ai/"
limit = 3

Model

Scrape emails from URL

We will use the requests library to get the HTML content of the webpage and the BeautifulSoup library to parse the HTML content. We will use a regular expression to extract the emails from the HTML content.

In [ ]:

unscraped = deque([url])  

scraped = set()  

emails = set()  

while len(unscraped):
    url = unscraped.popleft()  
    scraped.add(url)

    parts = urlsplit(url)
        
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
        path = url[:url.rfind('/')+1]
    else:
        path = url

    print("Crawling URL: %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        continue
        
    exclude = ["google.com", "gmail.com", "example.com"]    
    # Get emails from URL
    new_emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", url)
    for email in new_emails:
        for e in exclude:
            if not email.endswith(e):
                emails.update([email])
                
    # Get emails from content
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", response.text, re.I))
    for email in new_emails:
        for e in exclude:
            if not email.endswith(e):
                emails.update([email])
                
    if len(emails) >= limit:
        break

    soup = BeautifulSoup(response.text, 'lxml')
    for anchor in soup.find_all("a"):
        if "href" in anchor.attrs:
            link = anchor.attrs["href"]
        else:
            link = ''

        if link.startswith('/'):
            link = base_url + link
        
        elif not link.startswith('http'):
            link = path + link

        if not link.endswith(".gz"):
            if not link in unscraped and not link in scraped:
                unscraped.append(link)

print(emails)

Output

Display result

In [ ]:

print(f"🚀 {len(emails)} founded on {url}")
print(emails)