Decompressing a .bz2 file in Python

2024/10/11 16:31:39

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.

I'm quite a Python newbie, so the answer is probably quite obvious, please help me.

In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.

openZip = open(zipFile, "r")
s = ''
while True:newLine = openZip.readline()if(len(newLine)==0):breaks+=newLineprint s                   uncompressedData = bz2.decompress(s)

Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.

METHOD A:

print 'decompressing ' + filenamefileHandle = open(zipFile)
uncompressedData = ''while True:            s = fileHandle.read(1024)if not s:breakprint('RAW "%s"', s)uncompressedData += bz2.decompress(s)uncompressedData += bz2.flush()newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")newFile.write(uncompressedData)newFile.close()   

I get the error:

uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream

METHOD B

zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)s = fileHandle.read()
uncompressedData = bz2.decompress(s)

Same error :

uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream

Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.

By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

Answer

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.

uncompressedData = bz2.BZ2File(zipFile).read()

seems to be closer to what you're angling for.

Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:

opening ... the compressed file as ifit was a textfile ... It's NOT.

open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).

In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).

In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.

So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).

https://en.xdnf.cn/q/69753.html

Related Q&A

Why does Pandas coerce my numpy float32 to float64?

Why does Pandas coerce my numpy float32 to float64 in this piece of code:>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame([[1, 2, a], [3, 4, b]], dtype=np…

Conda and Python Modules

Sadly, I do not understand how to install random python modules for use within iPython Notebooks with my Anaconda distribution. The issue is compounded by the fact that I need to be able to do these t…

WeakValueDictionary retaining reference to object with no more strong references

>>> from weakref import WeakValueDictionary >>> class Foo(object): ... pass >>> foo = Foo() >>> db = WeakValueDictionary() >>> db[foo-id] = foo >>…

Using pretrained glove word embedding with scikit-learn

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.I need to do this in sklearn as well because I am using vecstack to ensemble both keras s…

Is there an easy way to tell how much time is spent waiting for the Python GIL?

I have a long-running Python service and Id like to know how much cumulative wall clock time has been spent by any runnable threads (i.e., threads that werent blocked for some other reason) waiting for…

Inverse filtering using Python

Given an impulse response h and output y (both one-dimensional arrays), Im trying to find a way to compute the inverse filter x such that h * x = y, where * denotes the convolution product.For example,…

Quadruple Precision Eigenvalues, Eigenvectors and Matrix Logarithms

I am attempting to diagonalize matrices in quadruple precision, and to take their logarithms. Is there a language in which I can accomplish this using built-in functions?Note, the languages/packages i…

How to use pyinstaller with pipenv / pyenv

I am trying to ship an executable from my python script which lives inside a virtual environment using pipenv which again relies on pyenv for python versioning. For that, I want to us pyinstaller. Wha…

Sending DHCP Discover using python scapy

I am new to python and learning some network programming, I wish to send an DHCP Packet through my tap interface to my DHCP server and expecting some response from it. I tried with several packet build…

cnf argument for tkinter widgets

So, Im digging through the code here and in every class (almost) I see an argument cnf={} to the constructor, but unless Ive missed it, it is not explicitly stated what cnf is / expected to contain. Ca…