how to extract a table column data present in pdf and stored inside a variable python

2024/10/5 15:00:33

I have 3 tables (image pasted) all 3 table(have same columns) look same and i want data of address column (yellow colour) of 3 tables stored inside a variable. enter image description here

Answer

There are different ways to handle extraction of tables from pdf. The final solution will depend primarily on individual pdf that you need to read. Some variables to think about when choosing the solution are:

  • is the pdf just an image saved as pdf (rastered image of a scanned document)?
  • what is the quality of pdf?
  • is there any noise in the pdf files (e.g. spots caused by a printer) you need to get rid of?
  • is the table in pdf skewed?
  • how many pages has a pdf?
  • how many pages a table spans across?
  • how many documents do you need to scan?

There are many solutions to extract tables from pdf ranging from table-specialized OCR services to python utility libraries to help you build your own extraction program.

An example of a powerful tool to convert data from tables from pdf to excel is Camelot, which you have included in your question's tags. It abstracts a lot of complexity involved in the task at hand. You just install it and access it for example like that:

import camelot
file = 'https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf'
tables = camelot.read_pdf(file)
tables[0].to_excel('table.xlsx')

As I mentioned, the devil lies in the individual characteristics of a table and a pdf file.

https://en.xdnf.cn/q/119920.html

Related Q&A

Pong Created in Python Turtle

Im new to Python but Ive coded in other languages, mainly for hardware. I made pong in Python using turtle but its a little glitchy. I was wondering if any of you could check it out and give advice. I …

How to build a Neural Network with sentence embeding concatenated to pre-trained CNN

I want to build a neural network that will take the feature map from the last layer of a CNN (VGG or resnet for example), concatenate an additional vector (for example , 1X768 bert vector) , and re-tra…

Obtaining values from columns in python

I want to obtain the top 3 cities and items based on their sales, but the only thing I can do now is return the all cities and items with their respective sales. Without using dict, can I obtain my des…

Is there a really efficient (FAST) way to read large text files in python?

I am looking to open and fetch data from a large text file in python as fast as possible (It almost has 62603143 lines - size 550MB). As I dont want to stress my computer, I am doing it by following wa…

How to extract all K*K submatrix of matrix with or without NumPy?

This is my input: row=6 col=9 6 9 s b k g s y w g f r g y e q j j a s s m s a s z s l e u s q u e h s s s g s f h s s e s g x d r h g y s s sThis is my code: r=int(input()) c=int(input()) n=min(r,c) k=…

How to scrape multiple result having same tags and class

My code is accurate for single page but when I run this code for multiple records using for loop and if there are some data missing like person then (as I used index no[1] and [2] for person variable ,…

Is there an alternative for sys.exit() in python?

try:x="blaabla"y="nnlfa" if x!=y:sys.exit()else:print("Error!") except Exception:print(Exception)Im not asking about why it is throwing an error. I know that it raises e…

Adding items to Listbox in Python Tkinter

I would like my Listbox widget to be updated upon clicking of a button. However I encountered a logic error. When I click on the button, nothing happens. No errors at all.listOfCompanies: [[1, ], [2, -…

Policy based design in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 9 years ago.Improve…

Exception raised: cannot import name manual_seed from torch

im trying to run the AutoClean project on my device (heres my code): import random from AutoClean import AutoClean import pandas as pddef __init__(self, pipeline, resultat ):self.pipeline = pipelinesel…