struct.error: unpack requires a string argument of length 16

2024/10/6 18:26:04

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:

pdf2txt.py 2.pdf Traceback (most recent call last):File "/usr/local/bin/pdf2txt.py", line 115, in <module>if __name__ == '__main__': sys.exit(main(sys.argv))File "/usr/local/bin/pdf2txt.py", line 109, in maininterpreter.process_page(page)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_pageself.render_contents(page.resources, page.contents, ctm=ctm)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contentsself.init_resources(resources)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resourcesself.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_fontfont = self.get_font(None, subspec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_fontfont = PDFCIDFont(self, spec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__StringIO(self.fontfile.get_data()))File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

While the similar file (1.pdf) doesn't cause a problem.

I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?


Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>if __name__ == '__main__': sys.exit(main(sys.argv))File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in maininterpreter.process_page(page)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_pageself.render_contents(page.resources, page.contents, ctm=ctm)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contentsself.init_resources(resources)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resourcesself.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_fontfont = self.get_font(None, subspec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_fontfont = PDFCIDFont(self, spec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__BytesIO(self.fontfile.get_data()))File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
Answer

TL; DR

Thanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.

Quick fix for this issue

I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.

Detailed diagnosis

As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?

As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.

This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.

Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.

The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.

I believe the attached patch will fix your problem and should be safe to use in general.

https://en.xdnf.cn/q/70337.html

Related Q&A

SELECT EXISTS vs. LIMIT 1

I see SELECT EXISTS used a lot like:if db.query("""SELECT EXISTS (SELECT 1 FROM checkoutWHERE checkout_id = %s)""" % checkout_id).getresult()[0][0] == t:vs. what i prefer:…

How to access a specific start_url in a Scrapy CrawlSpider?

Im using Scrapy, in particular Scrapys CrawlSpider class to scrape web links which contain certain keywords. I have a pretty long start_urls list which gets its entries from a SQLite database which is …

Escaping search queries for Googles full text search service

This is a cross-post of https://groups.google.com/d/topic/google-appengine/97LY3Yfd_14/discussionIm working with the new full text search service in gae 1.6.6 and Im having trouble figuring out how to …

dificulty solving a code in O(logn)

I wrote a function that gets as an input a list of unique ints in order,(from small to big). Im supposed to find in the list an index that matches the value in the index. for example if L[2]==2 the out…

Scrapy. How to change spider settings after start crawling?

I cant change spider settings in parse method. But it is definitely must be a way. For example:class SomeSpider(BaseSpider):name = mySpiderallowed_domains = [example.com]start_urls = [http://example.co…

numpy ctypes dynamic module does not define init function error if not recompiled each time

sorry for yet an other question about dynamic module does not define init function. I did go through older questions but I didnt find one which adress my case specifically enought.I have a C++ library …

How do I save Excel Sheet as HTML in Python?

Im working with this library XlsxWriter.Ive opened a workbook and written some stuff in it (considering the official example) - import xlsxwriter# Create a workbook and add a worksheet. workbook = xlsx…

Faster sockets in Python

I have a client written in Python for a server, which functions through LAN. Some part of the algorithm uses socket reading intensively and it is executing about 3-6 times slower, than almost the same …

Python gmail api send email with attachment pdf all blank

I am using python 3.5 and below code is mostly from the google api page... https://developers.google.com/gmail/api/guides/sending slightly revised for python 3.xi could successfully send out the email …

how to find height and width of image for FileField Django

How to find height and width of image if our model is defined as followclass MModel:document = FileField()format_type = CharField()and image is saved in document then how we can find height and width o…