Iterate a large .xz file line by line in python

2024/10/10 8:17:55

I have a large .xz file (few gigabytes). It's full of plain text. I want to process the text to create custom dataset. I want to read it line by line because it is too big. Anyone have an idea how to do it ?

I already tried this How to open and read LZMA file in-memory but it's not working.

EDIT: i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128)

on the line for line in uncompressed: from the link

EDIT2: My code (using python 3.5)

with open(filename) as compressed:with lzma.LZMAFile(compressed) as uncompressed:for line in uncompressed:print(line)
Answer

I was faced to the same question some weeks ago. This snippet worked for me:

import lzma
with lzma.open('filename.xz', mode='rt') as file:for line in file:print(line)

This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). There is an encoding argument in function lzma.open() which allows you to set another encoding if needed

EDIT (after you own edit): try to force encoding='utf-8' in lmza.open()

https://en.xdnf.cn/q/69913.html

Related Q&A

Detect multiple circles in an image

I am trying to detect the count of water pipes in this picture. For this, I am trying to use OpenCV and Python-based detection. The results, I am getting is a little confusing to me because the spread …

Need guidance with FilteredSelectMultiple widget

I am sorry if it question might turn to be little broad, but since I am just learning django (and I am just hobbyist developer) I need some guidance which, I hope, will help someone like me in the futu…

Django: determine which user is deleting when using post_delete signal

I want admins to be notified when certain objects are deleted but I also want to determine which user is performing the delete.Is it possible?This is the code:# models.py # signal to notify admins whe…

Double inheritance causes metaclass conflict

I use two django packages - django-mptt (utilities for implementing Modified Preorder Tree Traversal) and django-hvad (model translation).I have a model class MenuItem and I want to it extends Translat…

Mask area outside of imported shapefile (basemap/matplotlib)

Im plotting data on a basemap of the eastern seaboard of the U. S. and Canada through Matplotlib. In addition to the base layer (a filled contour plot), I overlayed a shapefile of this focus region ato…

Python Glob.glob: a wildcard for the number of directories between the root and the destination

Okay Im having trouble not only with the problem itself but even with trying to explain my question. I have a directory tree consisting of about 7 iterations, so: rootdir/a/b/c/d/e/f/destinationdirThe …

Get datetime format from string python

In Python there are multiple DateTime parsers which can parse a date string automatically without providing the datetime format. My problem is that I dont need to cast the datetime, I only need the dat…

Generating an optimal binary search tree (Cormen)

Im reading Cormen et al., Introduction to Algorithms (3rd ed.) (PDF), section 15.4 on optimal binary search trees, but am having some trouble implementing the pseudocode for the optimal_bst function in…

Pydub from_mp3 gives [Errno 2] No such file or directory

I find myself in front of a wall here, simply trying to load an audio file into pydub for converting it keeps on throwing a "[Errno 2] No such file or directory" error.Naturally I have spent …

Compile Python 3.6.2 on Debian Jessie segfaults on sharedmods

Im trying to compile Python 3.6.2 on a Debian Jessie box with the options./configure --prefix="/opt/python3" \ --enable-optimizations \--with-lto \ --enable-profiling \ --enable-unicode=ucs4 …