I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file, even though it has only 650 pages from the original 7000, now has more MB that the original file (498 MB to be exactly).
Is there any way to lower the filesize of the new PDF?
The script I used:
from PyPDF4 import PdfFileWriter, PdfFileReader
import os
import reoutput = PdfFileWriter()input = PdfFileReader(open('Binder.pdf', 'rb')) # open inputfor i in range(0, input.getNumPages()):content = ""content += input.getPage(i).extractText() + "\n"#Format 1RS = re.search('FIGURE', content)RS1 = #... Only one search given as example. I have more, but are irrelevant for the question.#....# Format 2RS20 = re.search('FIG.', content)RS21 = #... Only one search given as example. I have more, but are irrelevant for the question.#....if (all(v is not None for v in [RS, RS1, RS2, RS3, RS4, RS5, RS6, RS7, RS8, RS9]) or all(v is not None for v in [RS20, RS21, RS22, RS23, RS24, RS25, RS26, RS27, RS28, RS29, RS30, RS30])):p = input.getPage(i)output.addPage(p)#Save pages to new PDF file
with open('ExtractedPages.pdf', 'wb') as f:output.write(f)