PyPDF2 wont extract all text from PDF

2024/10/12 5:48:10

I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string:

b''

Here is my code:

import PyPDF2
import urllib.request
import iourl = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urllib.request.urlopen(url).read()
memory_file = io.BytesIO(remote_file)read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print(page_content.encode('utf-8'))

This code worked correctly on a few of the PDFs I'm working with (e.g. https://www.sec.gov/litigation/admin/2016/34-76837-proposed-amended-distribution-plan.pdf), but the others like the file above didn't work. Any idea what's wrong?

Answer

I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can:

import pdftotext
from six.moves.urllib.request import urlopen
import iourl = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)pdf = pdftotext.PDF(memory_file)# Iterate over all the pages
for page in pdf:print(page)

Extracted

                                  UNITED STATES OF AMERICABefore theSECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of                                        PROCEEDINGS, PURSUANT TO SECTION21C OF THE SECURITIES EXCHANGE ACTKEFEI WANG                                      OF 1934, MAKING FINDINGS, ANDIMPOSING REMEDIAL SANCTIONS AND A
Respondent.                                             CEASE-AND-DESIST ORDERI.The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).II.In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.III.On the basis of this Order and Respondent’s Offer, the Commission finds1 that:Summary1.      Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.Respondent2.      Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.Background3.      The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).4.      Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”5.      Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1The findings herein are made pursuant to Respondent’s Offer of Settlement and are notbinding on any other person or entity in this or any other proceeding.2Respondent Received Commissions for His Clients’ EB-5 Investments6.      From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.7.      Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.8.      As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.IV.In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:A.      Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.B.      Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:(1)     Respondent may transmit payment electronically to the Commission, which willprovide detailed ACH transfer/Fedwire instructions upon request;3(2)       Respondent may make direct payment from a bank account via Pay.gov through theSEC website at http://www.sec.gov/about/offices/ofm.htm; or(3)       Respondent may pay by certified check, bank cashier’s check, or United Statespostal money order, made payable to the Securities and Exchange Commission andhand-delivered or mailed to:Enterprise Services CenterAccounts Receivable BranchHQ Bldg., Room 181, AMZ-3416500 South MacArthur BoulevardOklahoma City, OK 73169Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.V.It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).By the Commission.Brent J. FieldsSecretary4[Finished in 0.5s]
https://en.xdnf.cn/q/69688.html

Related Q&A

Python 3.4 decode bytes

I am trying to write a file in python, and I cant find a way to decode a byte object before writing the file, basically, I am trying to decode this bytes string:Les \xc3\x83\xc2\xa9vad\xc3\x83\xc2\xa9s…

No module named unusual_prefix_*

I tried to run the Python Operator Example in my Airflow installation. The installation has deployed webserver, scheduler and worker on the same machine and runs with no complaints for all non-PytohnOp…

Python: Variables are still accessible if defined in try or if? [duplicate]

This question already has answers here:Short description of the scoping rules(9 answers)Closed last year.Im a Python beginner and I am from C/C++ background. Im using Python 2.7.I read this article: A …

Networkx Traveling Salesman Problem (TSP)

I would like to know if there is a function in NetworkX to solve the TSP? I can not find it. Am I missing something? I know its an NP hard problem but there should be some approximate solutions right

Comparing dateutil.relativedelta

Im trying to do a > comparison between two relativedeltas: if(relativedelta(current_date, last_activity_date) > relativedelta(minutes=15)):Here is the output from the debugger window in Eclipse:O…

Python. Argparser. Removing not-needed arguments

I am parsing some command-line arguments, and most of them need to be passed to a method, but not all.parser = argparse.ArgumentParser() parser.add_argument("-d", "--dir", help = &q…

Where can I find some hello world-simple Beautiful Soup examples?

Id like to do a very simple replacement using Beautiful Soup. Lets say I want to visit all A tags in a page and append "?foo" to their href. Can someone post or link to an example of how to …

Difference between isnumeric and isdecimal in Python

What is the difference between isnumeric and isdecimal functions for strings ( https://www.tutorialspoint.com/python3/python_strings.htm )? They seem to give identical results: >>> "1234…

Estimating zip size/creation time

I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities. Resources to be zipped are often > 1GB and not necessarily compression-friendly.How do I e…

How to rename the index of a Dask Dataframe

How would I go about renaming the index on a dask dataframe? I tried it like sodf.index.name = foobut rechecking df.index.name shows it still being whatever it was previously.