Extracting text from pdf using Python and Pypdf2

2024/10/9 12:30:35

I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code:

import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')p=opened_pdf.getPage(0)p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines

My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? Thanks! This is the string that code returns:

[u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i)and Appendix D are obtained from suppliers Material Safety Data Sheets(MSDS)** Information is based on the maximum potential forconcentration and thus the total may be over 100%* Total Water Volumesources may include fresh water, produced water, and/or recycledwater0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2SodiumHydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide24L0.00828%75.00%Organic phosphonic acidsalts0.00276%25.00%67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid6730.00807%30.00%7732-18-5Water0.00188%7.00%Polyethoxylated alcohol surfactants0.00753%28.00%9003-06-9AmmoniumSalts0.00941%35.00%64742-47-8Petroleum DistillateFrictionReducerPumpcoPlexslick9210.05029%60.00%7732-18-5Water0.03353%40.00%7647-01-0Hydrogen ChlorideHydrochloric AcidPumpcoHCL9.84261%100.00%14808-60-7CrystalineSilicaProppantPumpcoSand90.01799%100.00%7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid(% by mass)**MaximumIngredientConcentrationin Additive(% bymass)**Chemical AbstractService Number(CAS#)IngredientsPurposeSupplierTrade NameHydraulic Fracturing Fluid Composition:2,608,032Total Water Volume (gal)*:7,595True VerticalDepth (TVD):GasProduction Type:NAD27Long/LatProjection:32.558525Latitude:-97.215242Longitude:Ole Gieser Unit D6HWell Name and Number:XTO EnergyOperator Name:42-439-35084APINumber:TarrantCounty:TexasState:12/10/2010Fracture DateHydraulicFracturing Fluid Product Component Information Disclosure']

Screenshot of the file: enter image description here

Answer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIOdef convert_pdf_to_txt(path):rsrcmgr = PDFResourceManager()retstr = StringIO()codec = 'utf-8'laparams = LAParams()device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)fp = file(path, 'rb')interpreter = PDFPageInterpreter(rsrcmgr, device)password = ""maxpages = 0caching = Truepagenos=set()for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):interpreter.process_page(page)text = retstr.getvalue()fp.close()device.close()retstr.close()return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

Output

Hydraulic Fracturing Fluid Product Component Information Disclosure

Fracture Date State: County: API Number: Operator Name: Well Name andNumber: Longitude: Latitude: Long/Lat Projection: Production Type:True Vertical Depth (TVD): Total Water Volume (gal)*:

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H-97.21524232.558525 NAD27 Gas 7,595 2,608,032

Hydraulic Fracturing Fluid Composition:

Trade Name

Supplier

Purpose

Ingredients

Chemical Abstract Service Number

(CAS #)

Maximum Ingredient

Concentration

in Additive ( by mass)**

Comments

Maximum Ingredient

Concentration

in HF Fluid ( by mass)**

Water Sand HCL

Pumpco Pumpco

Proppant Hydrochloric Acid

Plexslick 921

Pumpco

Friction Reducer

Plexaid 673

Pumpco

Scale Inhibitor

Plexcide 24L

Pumpco

Biocide

Crystaline Silica

Hydrogen Chloride Water

Petroleum Distillate Ammonium Salts Polyethoxylated alcoholsurfactants Water

Methyl Alcohol Organic phosphonic acid salts

Dazomat Sodium Hydroxide Water

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00100.00

90.017999.84261

40.0060.00

35.0028.007.0030.00

25.0075.00

24.004.0072.00

0.033530.05029

0.009410.007530.001880.00807

0.002760.00828

0.004240.000710.01271

  • Total Water Volume sources may include fresh water, produced water, and/or recycled water** Information is based on the maximum potential for concentration and thus the total may be over 100

Ingredient information for chemicals subject to 29 CFR 1910.1200(i)and Appendix D are obtained from suppliers Material Safety Data Sheets(MSDS)

https://en.xdnf.cn/q/70018.html

Related Q&A

Is it possible to change turtles pen stroke?

I need to draw a bar graph using Pythons turtle graphics and I figured it would be easier to simply make the pen a thick square so I could draw the bars like that and not have to worry about making doz…

How to make a local Pypi mirror without internet access and with search available?

Im trying to make a complete local Pypi repository mirror with pip search feature on a server I can only connect an external hard drive to. To be clear, I dont want a simple caching system, the server …

Turn an application or script into a shell command

When I want to run my python applications from commandline (under ubuntu) I have to be in the directory where is the source code app.py and run the application with commandpython app.pyHow can I make i…

pytest - monkeypatch keyword argument default

Id like to test the default behavior of a function. I have the following:# app/foo.py DEFAULT_VALUE = hellodef bar(text=DEFAULT_VALUE):print(text)# test/test_app.py import appdef test_app(monkeypatch):…

How remove a program installed with distutils?

I have installed a python application with this setup.py:#!/usr/bin/env pythonfrom distutils.core import setup from libyouandme import APP_NAME, APP_DESCRIPTION, APP_VERSION, APP_AUTHORS, APP_HOMEPAGE,…

How to check which line of a Python script is being executed?

Ive got a Python script which is running on a Linux server for hours, crunching some numbers for me. Id like to check its progress, so Id like to see what line is being executed right now. If that was …

input to C++ executable python subprocess

I have a C++ executable which has the following lines of code in it /* Do some calculations */ . . for (int i=0; i<someNumber; i++){int inputData;std::cin >> inputData;std::cout<<"T…

pandas extrapolation of polynomial

Interpolating is easy in pandas using df.interpolate() is there a method in pandas that with the same elegance do something like extrapolate. I know my extrapolation is fitted to a second degree polyno…

Speed-up a single task using multi-processing or threading

Is it possible to speed up a single task using multi-processing/threading? My gut feeling is that the answer is no. Here is an example of what I mean by a "single task":for i in range(max):p…

Full outer join of two or more data frames

Given the following three Pandas data frames, I need to merge them similar to an SQL full outer join. Note that the key is multi-index type_N and id_N with N = 1,2,3:import pandas as pdraw_data = {type…