Email Extractor in Python - Web Scraping

This article is a step by step guide on how to extract emails from web using python. This python email extractor can be used to scrap emails from various websites. In this tutorial we are using this email crawler in python to extract emails from Google Search.

Prerequisites:

  1. Python 3.0 installed on your system
  2. Modules
    • selenium
    • pyautogui
    • pyperclip
    • time
    • sys
    • random
    • re
I have listed lot of modules but you can do it in few also. It depends on what is the requirement, what you want your program to do?

We have two parts in this script and we will execute them depending on the command line arguments.
  1. First is extracting emails from Google search
  2. Second is extracting emails from clipboard content i.e from files or a webpage, anything copied to the clipboard.

First list out the steps,
  1. For both parts we need to recognize emails. For this we can use regular expression.
  2. Open the browser with search query in URL.
  3. Select all text on current window (Ctrl + A), copy to clip board (Ctrl + C).
  4. Extract emails and save to a file.
  5. Click next on search page.
  6. Repeat 3-5 if next click is possible else stop.
  7. Extracting emails from clip board is same, you just have to extract and save to the file.
Use if-elif-else for checking command line and executing corresponding part like below,
# G - Google Search 
# M - Manual
if len(sys.argv)>1: # check if there are command line arguments except the file name
 if sys.arg[1]=='G':
  # code for extracting emails from Google
 elif sys.arg[2]=='M':
  # code for extracting emails from clipboard
else:
 print("Requires a command line argument")

1. Extract Emails Form Google Search

1.1 Defining Regex Object

Email address contains a pattern.
First any character combination from characters (a-z, A-Z, 0-9, .%+-) except comma
followed by @ symbol
followed by again any character combination from (a-z, A-Z, 0-9, . -) except comma
followed by dot and something 2-4 characters long.

This is a pattern. So you can create a email address regex object to recognize this pattern in your python script.
emailRegex = re.compile(r'''(
 [a-zA-Z0-9._%+-]+ #username
 @    # @ symbol
 [a-zA-Z0-9.-]+    #domain
 (\.[a-zA-Z]{2,4}) #Dot and something
 )''',re.VERBOSE)

# Also open a file here 
File1=open("your_file_path_here","a+")
You can use triple quotes (''') to split your regex on multiple lines. The second argument re.VERBOSE is used for allowing comments in the regex.

1.2 Open the Google URL with search query

If you observe the search URL every time you search on Google, you will find a pattern.
# this pattern is 
# https://www.google.com/search?q=Your_Search_Query
# so we can open the URL as
# Assuming that you have already created an instance of your browser as Browser

Browser.get("https://www.google.com/search?q=" + "@gmail.com"+ " " + "@yahoo.com" + " " +  "@rediffmail.com")

#maximize the window
Browser.maximize_window();

# you can add different domains to get more emails

1.3 Select and copy all content on search result page

For automation, controlling the mouse and keyboard we need pyautogui module.
# get the browser in focus, select and copy
# co-ordinates given to click() are best for 1600x900 resolution. 
pyautogui.click(76,820);pyautogui.hotkey('Ctrl','A');pyautogui.hotkey('Ctrl','C');

# writing on one line so that it executes immediately after 'Ctrl + A'

Note:Choose the co-ordinates of empty area to make browser window in focus.

1.4 Convert the text you copied into plain text with str() method.

text = str(pyperclip.paste()) 

1.5 Create an empty list to save emails.

matches = []

1.6 Find and append emails to matches list you created earlier.

for groups in emailRegex.findall(text):
	matches.append(groups[0])

1.7 Write all emails from matches list to a file.

for x in matches:
	File1.write(" %s \n" % str(x))

Note: Wait some time before clicking next button so that Google does not block your IP address.

1.8 Click next page until not possible

Here we try to click next page button, if it is not possible then we exit the program.
try:
	Browser.find_element_by_id("pnnext").click() # change the id if it is changed
except:
	exit()

1.9 Make it loop infinitely

So far we have seen the process of extracting emails from one SERP. But our purpose is to extract from all available so we need to make this process loop until there are no pages left. Add 1.3 to 1.8 code in while loop like below,
while True:
 # Add code from steps 1.3 to 1.8 

2. Extracting emails from clipboard content

For this part paste the code from 1.4 to 1.7 .

Source code of python script to extract email addresses

import pyautogui
from selenium import webdriver
import pyperclip, re,random
import time
import sys

emailRegex = re.compile(r'''(
	[a-zA-Z0-9._%+-]+ #username
	@     # @ symbol
	[a-zA-Z0-9.-]+ #domain
	(\.[a-zA-Z]{2,4}) #Dot and something
	)''',re.VERBOSE)

#"/home/PYTHON/FOR BLOG/emails.txt" Change this with path of file to save emails
File1=open("/home/AbhijitKokane/PYTHON/FOR BLOG/emails.txt","a+")

if len(sys.argv)>1:
	if sys.argv[1]=='G':
		B=webdriver.Firefox()
		B.get("https://www.google.com/search?q=" + "@gmail.com"+ " " + "@yahoo.com" + " " +  "@rediffmail.com") #Add more domains here and you will get more emails
		time.sleep(2)
		B.maximize_window()

		while True:
			pyautogui.click(76,820);pyautogui.hotkey('ctrl','a');pyautogui.hotkey('ctrl','c')
			text = str(pyperclip.paste()) 

			matches = []

			for groups in emailRegex.findall(text):
    matches.append(groups[0])
   
			for x in matches:
				File1.write(" %s \n" % str(x))

			time.sleep(random.randint(60,100)) # change pause time as per your requirement
			try:
				B.find_element_by_id("pnnext").click()
			except:
				exit()

	elif sys.argv[1]=='M':
			text = str(pyperclip.paste()) 

			matches = []

			for groups in emailRegex.findall(text):
				matches.append(groups[0])
   
			for x in matches:
				File1.write(" %s \n" % str(x))
else:
	print("Requires a command line argument")
To execute this regex email parser python script use the following syntax:
python filename.py argument, where argument can be 'G' or 'M'.

Note: You may face geckodriver not path issue.

Traceback (most recent call last): File "emailext.py", line 19, in <module> B=webdriver.Firefox() File "/usr/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__ self.service.start() File "/usr/lib/python2.7/site-packages/selenium/webdriver/common/service.py", line 83, in start os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

To resolve the geckodriver not in path problem follow below steps:
  1. Download the latest geckodriver from github.
  2. Extract and move the geckodriver file to /usr/local/bin/ directory.
  3. Now you are ready to run your program.
Instead of the adding it to bin, you can also add it temporarily in PATH to by using the syntax export PATH=$PATH:/path/to/geckodriver

Challenges:

  1. For the first part ('G') the program will stop when there are no pages left. On the last page google says something like "In order to show you the most relevant results, we have omitted some entries very similar to the 165 already displayed. If you wish, you will be able to repeat the search with the omitted results enclosed."

    Can you repeat the search to get more emails?
    Hint : use find_element_by_link_text()

  2. You can see that the code for second part ('M') is repeated. Can you optimize the program for that?