How to extract tables from a pdf with PDFMiner?

2024/9/24 19:18:35

I am trying to extract information from some tables in a pdf document.
Consider the input:

Title 1
some text some text some text some text some text
some text some text some text some text some textTable Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

I can get the outlines/titles as such:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:print (level, title)

This gives me:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')

Which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows:

if not document.is_extractable:raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):interpreter.process_page(page)layout = device.get_result()for element in layout:if isinstance(element, LTTextBox):text_from_pdf.write(''.join([i if ord(i) < 128 else ' 'for i in element.get_text()]))

Which gives me:

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

Which is a bit weird as the table is extracted in a column-wise fashion. Would it be possible for me to get the table row by row? Moreover, how can I identify where a table begins and ends?

Answer

If you only want to extract tables from PDF documents, then look at this answer: How to extract table as text from the PDF using Python?

From that answer, I have tried tabula-py which worked for me with tables of figures spread over multi-page PDF. tabula-py skipped properly all the headers and footers. Previously I had tried PDFMiner on this same type of document, and I had the same problem you mentioned, and sometimes even worse.

https://en.xdnf.cn/q/71665.html

Related Q&A

Draw Box-Plot with matplotlib

Is it possible to plot this kind of chart with matplotlib?

Why I get urllib2.HTTPError with urllib2 and no errors with urllib?

I have the following simple code:import urllib2 import sys sys.path.append(../BeautifulSoup/BeautifulSoup-3.1.0.1) from BeautifulSoup import * page=http://en.wikipedia.org/wiki/Main_Page c=urllib2.urlo…

python - replace the boolean value of a list with the values from two different lists [duplicate]

This question already has answers here:Merge two or more lists with given order of merging(2 answers)Closed 6 years ago.I have one list with boolean values likelyst = [True,True,False,True,False]and tw…

Convert pandas DataFrame to dict and preserve duplicated indexes

vagrant@ubuntu-xenial:~/lb/f5/v12$ python Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "licens…

Drawing rectangle on top of data using patches

I am trying to draw a rectangle on top of a data plot in matplotlib. To do this, I have this codeimport matplotlib.patches as patches import matplotlib.pyplot as pl...fig = pl.figure() ax=fig.add_axes(…

Setting row edge color of matplotlib table

Ive a pandas DataFrame plotted as a table using matplotlib (from this answer).Now I want to set the bottom edge color of a given row and Ive this code:import pandas as pd import numpy as np import matp…

TypeError: string indices must be integers (Python) [duplicate]

This question already has answers here:Why am I seeing "TypeError: string indices must be integers"?(10 answers)Closed 5 years ago.I am trying to retrieve the id value : ad284hdnn.I am getti…

how to split numpy array and perform certain actions on split arrays [Python]

Only part of this question has been asked before ([1][2]) , which explained how to split numpy arrays. I am quite new in Python. I have an array containing 262144 items and want to split it in small…

NLTK was unable to find the java file! for Stanford POS Tagger

I have been stuck trying to get the Stanford POS Tagger to work for a while. From an old SO post I found the following (slightly modified) code:stanford_dir = C:/Users/.../stanford-postagger-2017-06-09…

Append a list in Google Sheet from Python

I have a list in Python which I simply want to write (append) in the first column row-by-row in a Google Sheet. Im done with all the initial authentication part, and heres the code:credentials = Google…