How to Scrape Email and Phone Number from Any Website with Python

How to Scrape Email and Phone Number from Any Website with Python

Do have a large number of website URLs and want to scrape all emails and phone numbers from these URLs?

This tutorial will guide you to scrape all emails and phone numbers with Python and BeautifulSoup scripts.

Scraping contact details from a single website can be done manually without much difficulty however, when there are a large number of websites to scrape, the process can become time-consuming. A solution to this problem is to use a Python script.

In particular, using a Python script allows you to scrape data from thousands of URLs in a matter of minutes. This saves a significant amount of time and effort compared to manual scraping.

The script will use regular expressions to identify patterns that match the structure of email addresses and phone numbers. If the pattern is matched, the script will extract the relevant information from the website. By the end of this tutorial, you will have a working script that can quickly and efficiently scrape email and phone numbers from a website.

So, let’s begin!

Table of Contents

  1. Create a Python file
  2. Build Regular Expressions for Phone Numbers
  3. Build Regular Expression for Email ID
  4. Source code to read the URL from the CSV file
  5. Source Code to find all phone numbers and emails
  6. Running the program

Step 1: Create a Python file

Firstly, create a new Python file called email_phone_scrap.py. Then, import the necessary libraries that your program will need. Your program should look like below code block:

# email_phone_scrap.py - Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
import re   # for regular expressions
import requests  # for opening web page

Step 2: Build Regular Expressions for Phone Numbers

To search for phone numbers, you will need to create a regular expression.

First, let’s understand the structure of a typical phone number, which consists of three parts: the area code (usually three digits), the first three digits, and the last four digits. These parts are typically separated by a symbol, such as a hyphen or a space. For example 122-456-7890.

To create a regular expression for this pattern, you can use the following code:

Phone Numbers PatternRegular Expression
The extension (if any)(\d{3}|\(\d{3}\))?
Separator [- or .] (may or may not present)(\s|-|\.)?
First 3 digits(\d{3})
Separator(\s|-|\.)
Last 4 digits(\d{4})
Extension (if any)(\s*(ext|x|ext.)\s*(\d{2,5}))?

Let’s put it together to create a regex for the phone number.

# email_phone_scrap.py - Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
--snip--

# Create phone number regular expression
phone_regex = re.compile(r'''(
                        (\d{3}|\(\d{3}\))?
                        (\s|-|\.)?
                        (\d{3})
                        (\s|-|\.)
                        (\d{4})
                        (\s*(ext|x|ext.)\s*(\d{2,5}))?)''', re.VERBOSE)

Note: The re.VERBOSE is used to write comments in regular expressions.

If you are having difficulty understanding this code, then you must learn some basics of Python programming.

Step 3: Build Regular Expression for Email ID

Next, let’s move on to creating a regular expression to match the Email ID pattern.

When creating a regular expression for email addresses, you will need to consider the different parts of an email address. Typically, an email address consists of four main components:: the username, an @ symbol, the domain name, and a suffix (such as .com or .edu). For instance, an example email address could be contact@kushalstudy.com.

Email PatternRegular Expression
User name[a-zA-Z0-9._%+-]+
@ symbol@
Domain name[a-zA-Z0-9.-]+
Dot and something(\.[a-zA-Z]{2,4})

Let’s put it together to create a regex for email id.

# email_phone_scrap.py – Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
--snip—

# Create phone regular expression
phone_regex = re.compile(r'''(
--snip--

# Create email id regular expression
email_regex = re.compile(r'''(
                        [a-zA-Z0-9._%+-]+
                        @
                        [a-zA-Z0-9.-]+
                        (\.[a-zA-Z]{2,4}))''', re.VERBOSE)

Step 4: Source code to read the URL from the CSV file

Create a new CSV file with the name website_urls.csv and put all website URLs in column A. Store this CSV file in the same directory where email_phone_scrap.py is saved.

Next, create Python code to read the URL from the CSV file. Add the following code in the email_phone_scrap.py.

# email_phone_scrap.py – Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
--snip—

# Create phone number regular expression
phone_regex = re.compile(r'''(
--snip--

# Create email id regular expression
email_regex = re.compile(r'''(
--snip--

# Open URLs from CSV file and 
with open("website_urls.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')    
    
    for row in csv_reader:
        page_url = row[0]        
        print("Opening URL:", page_url)        
        page_data = requests.get(page_url)
        page_html = str(page_data.content)

Step 5: Source code to find all phone numbers and emails

The script will open webpages given in CSV files one by one. To find all phone numbers and emails, use the above-created regular expression code in our script.

# email_phone_scrap.py – Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
--snip--

# Create phone number regular expression
phone_regex = re.compile(r'''(
--snip--

# Create email id regular expression
email_regex = re.compile(r'''(
--snip--

# Open URLs from CSV file and 
with open("website_urls.csv") as csv_file:
--snip--
        matches = []
        for groups in phone_regex.findall(page_html):
            phone_numbers = '-'.join([groups[1], groups[3], groups[5]])
            if groups[8] != '':
                phone_numbers += ' x' + groups[8]            
            matches.append(phone_numbers)    
        
        for groups in email_regex.findall(page_html):
            matches.append(groups[0])
        
        print('\n'.join(matches))

The complete source code should look like this.

# email_phone_scrap.py - Scrap email and phone number from given websites.

import csv  # for reading/writing in CSV file
import re   # for regular expressions
import requests  # for opening web page

# Create phone number regular expression
phone_regex = re.compile(r'''(
                        (\d{3}|\(\d{3}\))?
                        (\s|-|\.)?
                        (\d{3})
                        (\s|-|\.)
                        (\d{4})
                        (\s*(ext|x|ext.)\s*(\d{2,5}))?)''', re.VERBOSE)

# Create email id regular expression
email_regex = re.compile(r'''(
                        [a-zA-Z0-9._%+-]+
                        @
                        [a-zA-Z0-9.-]+
                        (\.[a-zA-Z]{2,4}))''', re.VERBOSE)

# Open URLs from CSV file and 
with open("website_urls.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')    
    
    for row in csv_reader:
        page_url = row[0]        
        print("Opening URL:", page_url)
        page_data = requests.get(page_url)

        # Convert byte data to a string
        page_html = str(page_data.content)        
        
        matches = []
        for groups in phone_regex.findall(page_html):
            phone_numbers = '-'.join([groups[1], groups[3], groups[5]])
            if groups[8] != '':
                phone_numbers += ' x' + groups[8]            
            matches.append(phone_numbers)            

        for groups in email_regex.findall(page_html):
            matches.append(groups[0])
        
        print('\n'.join(matches))

Step 6: Running the program

To make things simpler, we will only include one URL in the CSV file for this example. When you run the program, the output should look similar to the following:

Opening URL: https://nostarch.com/contactus/
-010-4093
800-420-7240
415-863-9900
415-863-9950
support@nostarch.com
academic@nostarch.com
sales@nostarch.com
conferences@nostarch.com
errata@nostarch.com
support@nostarch.com
academic@nostarch.com
sales@nostarch.com
conferences@nostarch.com
errata@nostarch.com
info@nostarch.com
media@nostarch.com
editors@nostarch.com
rights@nostarch.com
support@nostarch.com
academic@nostarch.com

This output shows the phone numbers and email addresses that were found on the webpage. If you had multiple URLs in the CSV file, the program would scrape all of them and display the results in the same way.

Related: See our guide on How to Build a Multi-Threaded Web Scraper in Python if you want to create a multi-threaded scraper in Python.

Conclusion

This code efficiently extracts phone numbers and email IDs from the provided URL. You can put any number of URLs to extract this information. Additionally, if there are any changes in the format of the phone number, such as the absence of a separator like a hyphen, the code can handle it seamlessly.

To ensure simplicity, the extracted data is currently printed on the console. However, it can be easily modified to save the scraped data in a CSV file or any other desired file format.

The above code is specifically designed to extract phone numbers and email addresses from a given URL. You have the flexibility to include as many URLs as you like in the CSV file, and the program will scrape the data from all of them. If the format of the phone numbers or email addresses on the webpage changes (for example, if the hyphens are removed from the phone numbers), the script can effortlessly adapt to match the new format.

1 thought on “How to Scrape Email and Phone Number from Any Website with Python”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top