How to Extract Specific Text from PDF in Python?

Jump to python code to extract specific text from PDF file.

How to Extract Specific Text from PDF file in Python?

To extract specific text from PDF file you can use regular expressions. Use regex module of python helps to build pattern matching regular expressions.

Like for example, our pdf file contains student information as follows:
samplepdf.pdf

Name: Rahul
Age: 28
Adress: Pune

Name: John
Age: 27
Adress: New York

Name: Rina
Age: 28
Adress: Mumbai

Name: Scarlett
Age: 29
Adress: New York
If you want to extract only names you can create a regex object as given below
# import regex module 
import re

nameRegex = re.compile(r'Name:\s\w+')

mo = nameRegex.findall("Name: Rahul Name: John")

print(mo)

Output:

['Name: Rahul', 'Name: John']
Similarly you cam obtain the specific text that you want to extract from a PDF file using python.

Shorthand Codes for Common Character Classes
Shorthand character class Represents
\d Any numeric digit from 0 to 9.
\D Any character that is not a numeric digit from 0 to 9.
\w Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)
\W Any character that is not a letter, numeric digit, or the underscore character.
\s Any space, tab, or newline character. (Think of this as matching “space” characters.)
\S Any character that is not a space, tab, or newline.

That was about extracting specific part from a text. So we first need to convert the PDF file into text i.e. extract text from PDF file.

Python Code to Extract Text from PDF file

import PyPDF2

pdfFileObj = open('samplepdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

no_of_pgaes = pdfReader.numPages

text_string = "";

for i in range(0,no_of_pages + 1):
    pageObj = pdfReader.getPage(i)
    
    text_string += pageObj.extractText()
Now that you have all the text in PDF you can extract the required information only.

Python Code to Extract Specific Text from a PDF

import PyPDF2

pdfFileObj = open('samplepdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

no_of_pgaes = pdfReader.numPages

text_string = ""

for i in range(0,no_of_pages + 1):
    pageObj = pdfReader.getPage(i)
    
    text_string += pageObj.extractText()

import re

nameRegex = re.compile(r'Name:\s\w+')

mo = nameRegex.findall(text_string)

print(mo)

Output:

['Name: Rahul', 'Name: John', 'Name: Rina', 'Name: Scarlett']