PyPDF4 - Exported PDF file size too big

2024/4/15 0:49:39

I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file, even though it has only 650 pages from the original 7000, now has more MB that the original file (498 MB to be exactly).

Is there any way to lower the filesize of the new PDF?

The script I used:

from PyPDF4 import PdfFileWriter, PdfFileReader
import os
import reoutput = PdfFileWriter()input = PdfFileReader(open('Binder.pdf', 'rb')) # open inputfor i in range(0, input.getNumPages()):content = ""content += input.getPage(i).extractText() + "\n"#Format 1RS ='FIGURE', content)RS1 = #... Only one search given as example. I have more, but are irrelevant for the question.#....# Format 2RS20 ='FIG.', content)RS21 = #... Only one search given as example. I have more, but are irrelevant for the question.#....if (all(v is not None for v in [RS, RS1, RS2, RS3, RS4, RS5, RS6, RS7, RS8, RS9]) or all(v is not None for v in [RS20, RS21, RS22, RS23, RS24, RS25, RS26, RS27, RS28, RS29, RS30, RS30])):p = input.getPage(i)output.addPage(p)#Save pages to new PDF file
with open('ExtractedPages.pdf', 'wb') as f:output.write(f)

After a lot of searching found some solutions. The only problem with the exported PDF file was that it was uncompressed. So I needed a solution to compress a PDF file:

  1. PyPDF2 and/or PyPDF4 do not have an option to compress PDFs. PyPDF2 had the compressContentStreams() method, which doesn't work.

  2. Found a few other solutions that claim to compress PDFs, but none worked for me (adding them here just in case they work for others): pylovepdf ; pdfsizeopt ; pdfc

  3. The first solution that worked for me was Adobe Acrobat professional. It reduced the size from 498 MB to 2.99 MB.

  4. [Best Solution] As an alternative, open source solution that works, I found coherentpdf. For Windows you can download the pre-built PDF squeezer tool. Then in cmd:

    cpdfsqueeze.exe input.pdf output.pdf

This actually compressed the PDF even more than Adobe Acrobat. From 498 MB to 2.48 MB. Compressed to 0.5% from original. I think this is the best solution as it can be added to your Python Code.

  1. Edit: Found another Free solution that has a GUI also. PDFsam. You can use the Merge feature on one PDF file, and in the advanced Settings make sure you have the Compress Output checked. This compressed from 498 to 3.2 MB. enter image description here

Related Q&A

Jupyter install fails on Mac

Im trying to install Jupyter on my Mac (OS X El Capitan) and Im getting an error in response to:sudo pip install -U jupyterAt first the download/install starts fine, but then I run into this:Installing…

Python Error Codes are upshifted

Consider a python script error.pyimport sys sys.exit(3)Invokingpython; echo $?yields the expected "3". However, consider runner.pyimport os result = os.system("python…

Running dozens of Scrapy spiders in a controlled manner

Im trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this oth…

How to merge two DataFrame columns and apply pandas.to_datetime to it?

Im learning to use pandas, to use it for some data analysis. The data is supplied as a csv file, with several columns, of which i only need to use 4 (date, time, o, c). Ill like to create a new DataFr…

Breaking a parent function from within a child function (PHP Preferrably)

I was challenged how to break or end execution of a parent function without modifying the code of the parent, using PHPI cannot figure out any solution, other than die(); in the child, which would end …

Get absolute path of caller file

Say I have two files in different directories: (say, in C:/FIRST_FOLDER/ and (say, in C:/SECOND_FOLDER/ file imports (using sys.path.insert(0, followed,…

Pandas dataframe to excel gives file is not UTF-8 encoded

Im working on lists that I want to export into an Excel file.I found a lot of people advising to use pandas.dataframe so thats what I did. I could create the dataframe but when I try to export it to Ex…

Finding complex roots from set of non-linear equations in python

I have been testing an algorithm that has been published in literature that involves solving a set of m non-linear equations in both Matlab and Python. The set of non-linear equations involves input v…

Python on Raspberry Pi user input inside infinite loop misses inputs when hit with many

I have a very basic parrot script written in Python that simply prompts for a user input and prints it back inside an infinite loop. The Raspberry Pi has a USB barcode scanner attached for the input.wh…

How to append two bytes in python?

Say you have b\x04 and b\x00 how can you combine them as b\x0400?