what is the best way to extract data from pdf

2024/10/5 9:27:14

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.

enter image description here

I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code

import PyPDF2
try:
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageNumber = pdfReader.numPages
page = pdfReader.getPage(0)
print(pageNumber)
pagecontent = page.extractText()
print(pagecontent)
except Exception as e:
print(e)

but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?

Answer

pdfplumber is the best option. [Reference]

Installation

pip install pdfplumber

Extract all the text

import pdfplumber
path = 'path_to_pdf.pdf'
with pdfplumber.open(path) as pdf:for  page  in pdf.pages:print(page.extract_text())
https://en.xdnf.cn/q/70497.html

Related Q&A

Get random key:value pairs from dictionary in python

Im trying to pull out a random set of key-value pairs from a dictionary I made from a csv file. The dictionary contains information for genes, with the gene name being the dictionary key, and a list of…

UnicodeDecodeError: ascii codec cant decode byte 0xc5

UnicodeDecodeError: ascii codec cant decode byte 0xc5 in position 537: ordinal not in range(128), referer: ...I always get this error when I try to output my whole website with characters "č"…

wpa-handshake with python - hashing difficulties

I try to write a Python program which calculates the WPA-handshake, but I have problems with the hashes. For comparison I installed cowpatty (to see where I start beeing wrong).My PMK-generation works …

Group by column in pandas dataframe and average arrays

I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays).ID Year Title Genre Word Vector 1 2003.0 Dinosaur Planet Documentary [-0.55423898,…

Python dynamic properties and mypy

Im trying to mask some functions as properties (through a wrapper which is not important here) and add them to the object dynamically, however, I need code completion and mypy to work.I figured out how…

Flask-login: remember me not working if login_managers session_protection is set to strong

i am using flask-login to integrate session management in my flask app. But the remember me functionality doesnt work if i set the session_protection to strong, however, it works absolutely fine if its…

Does any magic happen when I call `super(some_cls)`?

While investigating this question, I came across this strange behavior of single-argument super:Calling super(some_class).__init__() works inside of a method of some_class (or a subclass thereof), but …

How to get unpickling to work with iPython?

Im trying to load pickled objects in iPython.The error Im getting is:AttributeError: FakeModule object has no attribute WorldAnybody know how to get it to work, or at least a workaround for loading obj…

Basic questions about nested blockmodel in graph-tool

Very briefly, two-three basic questions about the minimize_nested_blockmodel_dl function in graph-tool library. Is there a way to figure out which vertex falls onto which block? In other words, to ext…

How to get multiple parameters with same name from a URL in Pylons?

So unfortunately I find myself in the situation where I need to modify an existing Pylons application to handle URLs that provide multiple parameters with the same name. Something like the following...…