Question 1

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:

pdf2txt.py 2.pdf Traceback (most recent call last):File "/usr/local/bin/pdf2txt.py", line 115, in <module>if __name__ == '__main__': sys.exit(main(sys.argv))File "/usr/local/bin/pdf2txt.py", line 109, in maininterpreter.process_page(page)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_pageself.render_contents(page.resources, page.contents, ctm=ctm)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contentsself.init_resources(resources)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resourcesself.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_fontfont = self.get_font(None, subspec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_fontfont = PDFCIDFont(self, spec)File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__StringIO(self.fontfile.get_data()))File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

While the similar file (1.pdf) doesn't cause a problem.

I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?

Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>if __name__ == '__main__': sys.exit(main(sys.argv))File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in maininterpreter.process_page(page)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_pageself.render_contents(page.resources, page.contents, ctm=ctm)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contentsself.init_resources(resources)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resourcesself.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_fontfont = self.get_font(None, subspec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_fontfont = PDFCIDFont(self, spec)File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__BytesIO(self.fontfile.get_data()))File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

Question 2

TL; DR

Thanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.

Quick fix for this issue

I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.

Detailed diagnosis

As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?

As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.

This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.

Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.

The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.

I believe the attached patch will fix your problem and should be safe to use in general.

struct.error: unpack requires a string argument of length 16

Related Q&A

SELECT EXISTS vs. LIMIT 1

How to access a specific start_url in a Scrapy CrawlSpider?

Escaping search queries for Googles full text search service

dificulty solving a code in O(logn)

Scrapy. How to change spider settings after start crawling?

numpy ctypes dynamic module does not define init function error if not recompiled each time

How do I save Excel Sheet as HTML in Python?

Faster sockets in Python

Python gmail api send email with attachment pdf all blank

how to find height and width of image for FileField Django