Thursday, December 20, 2018

Work with PDF Documents Using Python

Python uses PyPDF2 module to work with PDF documents. Using this module it is possible to: 
  • Extracting document information 
  • Splitting documents page by page
  • Merging documents page by page
  • Cropping pages
  • Merging multiple pages into a single page
  • Encrypting and decrypting PDF files

To use PyPDF2 module, first install it from command line as shown below:

pip install PyPDF2

After the installation is done you are ready to use Python for your PDF documents. Let's start with reading a pdf file program. See the code below:

import PyPDF2

myfile = open('Algorithms For Dummies.pdf','rb')

readfile = PyPDF2.PdfFileReader(myfile)

First, import the PyPDF2 module. Then open Algorithms For Dummies.pdf' in read binary mode and store it in myfile . To get a PdfFileReader object that represents this PDF, call PyPDF2.PdfFileReader() and pass it myfile . Store this PdfFileReader object in readfile . Once we have the PdfFileReader object we can use it for extracting elements from the PDF document. Some of the common operations are:
1. Checking the number of pages 

The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object. We can use this attribute as shown in the following program:

import PyPDF2

myfile = open('Algorithms For Dummies.pdf','rb')

readfile = PyPDF2.PdfFileReader(myfile)

print(readfile.numPages)

We can also use the getNumPages() method  of th PdfFileReader object to get the number of pages of the PDF document as shown in the code:

print(readfile.getNumPages())

Use both the ways in the program, run the program to check the output which should be as shown below:

435
435

------------------
(program exited with code: 0)

Press any key to continue . . . 

2. Extracting text from page

To extract text from a PDF page we first get a Page object which represents a single page of a PDF, from a PdfFileReader object. This is done using the getPage() method which is called on using a  
PdfFileReader object with an argument specifying a page number. A point to remember is that PyPDF2 uses a zero-based index for getting pages: The first page is page 0, the second is page 1, and so on. This is always the case, even if pages are numbered differently within the document.

After we have the Page object we call its extractText() method to return a string of the page’s text. The following example implements this logic:

import PyPDF2

myfile = open('Algorithms For Dummies.pdf','rb')

readfile = PyPDF2.PdfFileReader(myfile)

page_obj = readfile.getPage(18)

print(page_obj.extractText())

 
The code page_obj = readfile.getPage(18)  creates an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. We passed 18 as an argument to get reference to page number 19 of our document.

Page object has function extractText() to extract text from the pdf page as we did in the following code print(page_obj.extractText()) which not only extracts the text but also prints it. Run the program to see the extracted text as shown below:

Introduction Where to Go from Here

pace that allows you to absorb as much of the material as possible. Make sure to

read about Python because the book uses this language as needed for the
examples.

-ples wonTt work with the 2.
x version of Python because this version doesnTt sup
-
port some of the packages we use.Readers who have some exposure to Python, and h
ave the appropriate language

always go back to earlier chapters as necessary when you have questions. However
,
you do need to understand how each technique works before moving to the next
one. Every technique, coding example, and procedure has important lessons for
you, and you could miss vital content if you start skipping too much information.

------------------
(program exited with code: 0)

Press any key to continue . . .

As you may have noticed the text extraction isn’t perfect as PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.

Just to check your file handling abilities, instead of printing, save the extracted text to a file.

3. Decrypting PDFs

Decrypting of PDF is required as Some PDF documents have an encryption feature that will keep them from being read until whoever is opening the document provides a password. All PdfFileReader objects have an isEncrypted attribute that is True if the PDF is encrypted and False if it isn’t. See the code below:

import PyPDF2

readfile = PyPDF2.PdfFileReader(open('Algorithms For Dummies.pdf','rb'))

print(readfile.isEncrypted)

Run this program and you'll find the output as False as this document is not encrypted. 

False
------------------
(program exited with code: 0)

Press any key to continue . . .

Now try the same program using an encrypted document as shown below:

import PyPDF2

readfile = PyPDF2.PdfFileReader(open('Exceptions in Python.pdf','rb'))

print(readfile.isEncrypted)

Run this program and you'll find the output as True as this document is encrypted. 

True 
------------------
(program exited with code: 0)

Press any key to continue . . .

Any attempt to call a function that reads the file before it has been decrypted with the correct password will result in an error. Let's call the getPage() method over an encrypted document and see the result:

import PyPDF2

readfile = PyPDF2.PdfFileReader(open('Exceptions in Python.pdf','rb'))

page_obj = readfile.getPage(0)

When we run this program we get the following output:

Traceback (most recent call last):
  File "readingpdf.py", line 7, in <module>
    page_obj = readfile.getPage(0)
  File "C:\Users\Python\AppData\Local\Programs\Python\Python36\lib\site-packages
\PyPDF2\pdf.py", line 1176, in getPage
    self._flatten()
  File "C:\Users\Python\AppData\Local\Programs\Python\Python36\lib\site-packages
\PyPDF2\pdf.py", line 1505, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\Users\Python\AppData\Local\Programs\Python\Python36\lib\site-packages
\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Users\Python\AppData\Local\Programs\Python\Python36\lib\site-packages
\PyPDF2\generic.py", line 178, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Users\Python\AppData\Local\Programs\Python\Python36\lib\site-packages
\PyPDF2\pdf.py", line 1617, in getObject
    raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted
------------------
(program exited with code: 1)

Press any key to continue . . .

 
The PdfFileReader object has a decrypt() function which can be used to read encrypted documents. See the code below:

import PyPDF2

readfile = PyPDF2.PdfFileReader(open('Exceptions in Python.pdf','rb'))

readfile.decrypt('abc@123')

page_obj = readfile.getPage(0)

Now run the program, it worked..right?

After we call decrypt() with the correct password, you’ll see that calling getPage() no longer causes an error. If given the wrong password, the decrypt() function will return 0 and getPage() will continue to fail.

A key point to remember is that the decrypt() method decrypts only the PdfFileReader object, not the actual PDF file. After your program terminates, the file on your hard drive remains encrypted. Your program will have to call decrypt() again the next time it is run.

Some ways of encrypting files may not be understood by PyPDF2. It will generate NotImplementedError: only algorithm code 1 and 2 are supported.

4. Creating PDFs

I was wondering if it's possible to create new PDF documents from scratch and found that using PdfFileWriter objects we can create new PDF files. But PyPDF2 cannot write arbitrary text to a PDF
like Python can do with plaintext files. Instead, PyPDF2’s PDF-writing capabilities are limited to copying pages from other PDFs, rotating pages, overlaying pages, and encrypting files. Also PyPDF2 doesn’t allow us to directly edit a PDF. Instead, we have to create a new PDF and then copy content over from an existing document.

See the example below:

import PyPDF2

pdf1File = open('The write.pdf', 'rb')
pdf2File = open('The write1.pdf', 'rb')

pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)

pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdf1Reader.numPages):
    
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)
    
    

for pageNum in range(pdf2Reader.numPages):
    
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)


pdfOutputFile = open('combineddocs.pdf', 'wb')
pdfWriter.write(pdfOutputFile)


I have two separate PDF documents, The write.pdf and The write1.pdf which contains some text. I am merging these two documents and storing the combined PDFs as combineddocs.pdf. I have opened both PDF files in read binary mode and store the two resulting File objects in pdf1File and pdf2File. 

Then I Call PyPDF2.PdfFileReader() and pass it pdf1File to get a PdfFileReader object for The write.pdf and then after Call PyPDF2.PdfFileReader() again and pass it pdf2File to get a PdfFileReader object for The write1.pdf

Next I create a new PdfFileWriter object, which represents a blank PDF document. Later on I copy all the pages from the two source PDFs and add them to the PdfFileWriter object. Now using the pdf1Reader I get the Page object by calling getPage() on a PdfFileReader object and pass that Page object to our PdfFileWriter’s addPage() method. 

Again using the pdf2Reader I get the Page object by calling getPage() on a PdfFileReader object and pass that Page object to our PdfFileWriter’s addPage() method.

Finally I write a new PDF called combineddocs.pdf by passing a File object to the PdfFileWriter’s
write() method. Run this program and check your program directory to verify that combineddocs.pdf file has been created there which contains the merged content of  The write.pdf and The write1.pdf document files.


5. Rotating pages in PDFs
 
The pages of a PDF can also be rotated in 90-degree increments with the rotateClockwise() and rotateCounterClockwise() methods. Pass one of the integers 90, 180, or 270 to these methods. See the code below:

import PyPDF2

pdf1File = open('The write.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf1File)

page = pdfReader.getPage(0)
page.rotateClockwise(90)


pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(page)


rotatedPdfFile = open('rotatedTheWritePage.pdf', 'wb')
pdfWriter.write(rotatedPdfFile)

When we run this program the resulting PDF will have one page, rotated 90 degrees clockwise and this will be stored in your program directory.


 
5. Overlaying pages in PDFs

It is possible to overlay the contents of one page over another, which is useful for adding a logo, timestamp, or watermark to a page. With Python, it’s easy to add watermarks to multiple files and only to pages your program specifies. See the example below:

import PyPDF2

pdf1File = open('The write.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf1File)

first_page = pdfReader.getPage(0)
pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermark.pdf', 'rb'))
first_page.mergePage(pdfWatermarkReader.getPage(0))

pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(first_page)

for pageNum in range(1, pdfReader.numPages):
    
    pageObj = pdfReader.getPage(pageNum)
    pdfWriter.addPage(pageObj)


rotatedPdfFile = open('watermarkedCover.pdf', 'wb')
pdfWriter.write(rotatedPdfFile)

I have used a water marked file 'watermark.pdf' and using the water marking in this file, water marked my 'The write.pdf' document. When you run this program you'll find a new document 
watermarkedCover.pdf created in your program directory which has it's first page water marked and has all the contents of the 'The write.pdf' document.


6. Encrypting PDFs

It is also possible to add an encryption to a PDF document using a PdfFileWriter object. See the code below:

import PyPDF2

pdf1File = open('watermarkedCover1.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf1File)


pdfWriter = PyPDF2.PdfFileWriter()


for pageNum in range(pdfReader.numPages):
    
    
    pdfWriter.addPage(pdfReader.getPage(pageNum))

pdfWriter.encrypt('abc@123')
encryptedPdfFile = open('encryptedwatermarkedCover1.pdf', 'wb')
pdfWriter.write(encryptedPdfFile)


We have called the encrypt() method before calling the write() method to save to a file and pass it a password string. The user password and owner password are the first and second arguments to encrypt(), respectively. If only one string argument is passed to encrypt(), it will be used for both passwords.

I have we copied the pages of watermarkedCover1.pdf to a PdfFileWriter object. We encrypted the PdfFileWriter with the password abc@123, opened a new PDF called encryptedwatermarkedCover1.pdf, and wrote the contents of the PdfFileWriter to the new PDF. Before anyone can view encryptedwatermarkedCover1.pdf, they’ll have to enter this password. 

This brings an end to our discussion so till we meet next keep learning Python as Python is easy to learn!
Share:

0 comments:

Post a Comment