Stream multiple files into a readable object in Python

2024/10/13 1:16:34

I have a function which processes binary data from a file using file.read(len) method. However, my file is huge and is cut into many smaller files 50 MBytes each. Is there some wrapper class that feeds many files into a buffered stream, and provides a read() method?

Class fileinput.FileInput can do such a thing, but it supports only line-by-line reading (method readline() with no arguments) and does not have read(len) with specifying number of bytes to read.

Answer

Instead of converting the list of streams into a generator - as some of the other answers do - you can chain the streams together and then use the file interface:

def chain_streams(streams, buffer_size=io.DEFAULT_BUFFER_SIZE):"""Chain an iterable of streams together into a single buffered stream.Usage:def generate_open_file_streams():for file in filenames:yield open(file, 'rb')f = chain_streams(generate_open_file_streams())f.read()"""class ChainStream(io.RawIOBase):def __init__(self):self.leftover = b''self.stream_iter = iter(streams)try:self.stream = next(self.stream_iter)except StopIteration:self.stream = Nonedef readable(self):return Truedef _read_next_chunk(self, max_length):# Return 0 or more bytes from the current stream, first returning all# leftover bytes. If the stream is closed returns b''if self.leftover:return self.leftoverelif self.stream is not None:return self.stream.read(max_length)else:return b''def readinto(self, b):buffer_length = len(b)chunk = self._read_next_chunk(buffer_length)while len(chunk) == 0:# move to next streamif self.stream is not None:self.stream.close()try:self.stream = next(self.stream_iter)chunk = self._read_next_chunk(buffer_length)except StopIteration:# No more streams to chain togetherself.stream = Nonereturn 0  # indicate EOFoutput, self.leftover = chunk[:buffer_length], chunk[buffer_length:]b[:len(output)] = outputreturn len(output)return io.BufferedReader(ChainStream(), buffer_size=buffer_size)

Then use it as any other file/stream:

f = chain_streams(open_files_or_chunks)
f.read(len)
https://en.xdnf.cn/q/69587.html

Related Q&A

AWS Python SDK | Route 53 - delete resource record

How to delete a DNS record in Route 53? I followed the documentation but I still cant make it work. I dont know if Im missing something here.Based on the documentation:DELETE : Deletes a existing reso…

How can I change to gt; and gt; to ? [duplicate]

This question already has answers here:Decode HTML entities in Python string?(7 answers)Closed 8 years ago.print u<How can I print <print > How can I print >

basemap: How to remove actual lat/lon lines while keeping the ticks on the axis

I plotted a map by basemap as below:plt.figure(figsize=(7,6)) m = Basemap(projection=cyl,llcrnrlat=40.125,urcrnrlat=44.625,\llcrnrlon=-71.875,urcrnrlon=-66.375,resolution=h) m.drawparallels(np.arange(i…

Re-initialize variables in Tensorflow

I am using a Tensorflow tf.Saver to load a pre-trained model and I want to re-train a few of its layers by erasing (re-initializing to random) their appropriate weights and biases, then training those …

Python: invert image with transparent background (PIL, Gimp,...)

I have a set of white icons on transparent background, and Id like to invert them all to be black on transparent background. Have tried with PIL (ImageChops) but it does not seem to work with transpare…

vlookup between 2 Pandas dataframes

I have 2 pandas Dataframes as follows.DF1:Security ISIN ABC I1 DEF I2 JHK I3 LMN I4 OPQ I5and DF2:ISIN ValueI2 100I3 200I5 …

replacing quotes, commas, apostrophes w/ regex - python/pandas

I have a column with addresses, and sometimes it has these characters I want to remove => - " - ,(apostrophe, double quotes, commas)I would like to replace these characters with space in one s…

Reading text from image

Any suggestions on converting these images to text? Im using pytesseract and its working wonderfully in most cases except this. Ideally Id read these numbers exactly. Worst case I can just try to u…

XGBoost and sparse matrix

I am trying to use xgboost to run -using python - on a classification problem, where I have the data in a numpy matrix X (rows = observations & columns = features) and the labels in a numpy array y…

How to preserve form fields in django after unsuccessful submit?

Code from views.py:def feedback(request):if request.method == "POST":form = CommentForm(request.POST)if form.is_valid():form.save()else:print("form.errors:", form.errors)else:form =…