How to Extract Specific Text from PDF in Python?
Jump to python code to extract specific text from PDF file.
Like for example, our pdf file contains student information as follows:
Shorthand Codes for Common Character Classes
That was about extracting specific part from a text. So we first need to convert the PDF file into text i.e. extract text from PDF file.
How to Extract Specific Text from PDF file in Python?
To extract specific text from PDF file you can use regular expressions. Useregex
module of python helps to build pattern matching regular expressions.Like for example, our pdf file contains student information as follows:
samplepdf.pdf Name: Rahul Age: 28 Adress: Pune Name: John Age: 27 Adress: New York Name: Rina Age: 28 Adress: Mumbai Name: Scarlett Age: 29 Adress: New YorkIf you want to extract only names you can create a regex object as given below
# import regex module
import re
nameRegex = re.compile(r'Name:\s\w+')
mo = nameRegex.findall("Name: Rahul Name: John")
print(mo)
Output:
['Name: Rahul', 'Name: John']Similarly you cam obtain the specific text that you want to extract from a PDF file using python.
Shorthand Codes for Common Character Classes
Shorthand character class | Represents |
---|---|
\d | Any numeric digit from 0 to 9. |
\D | Any character that is not a numeric digit from 0 to 9. |
\w | Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.) |
\W | Any character that is not a letter, numeric digit, or the underscore character. |
\s | Any space, tab, or newline character. (Think of this as matching “space” characters.) |
\S | Any character that is not a space, tab, or newline. |
That was about extracting specific part from a text. So we first need to convert the PDF file into text i.e. extract text from PDF file.
Python Code to Extract Text from PDF file
import PyPDF2
pdfFileObj = open('samplepdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
no_of_pgaes = pdfReader.numPages
text_string = "";
for i in range(0,no_of_pages + 1):
pageObj = pdfReader.getPage(i)
text_string += pageObj.extractText()
Now that you have all the text in PDF you can extract the required information only.
Python Code to Extract Specific Text from a PDF
import PyPDF2
pdfFileObj = open('samplepdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
no_of_pgaes = pdfReader.numPages
text_string = ""
for i in range(0,no_of_pages + 1):
pageObj = pdfReader.getPage(i)
text_string += pageObj.extractText()
import re
nameRegex = re.compile(r'Name:\s\w+')
mo = nameRegex.findall(text_string)
print(mo)
Output:
['Name: Rahul', 'Name: John', 'Name: Rina', 'Name: Scarlett']