Memory Error Python Processing Large File Line by Line

2024/10/15 7:28:11

I am trying to concatenate model output files, the model run was broken up in 5 and each output corresponds to one of those partial run, due to the way the software outputs to file it start relabelling from 0 on each of the file outputs. I wrote some code to:

1) concatenate all the output files together 2) edit the merged file to re-label all timesteps, starting at 0 and increasing by an increment at each one.

The aim is that I can load this single file into my visualization software in one chunk, rather than open 5 different windows.

So far my code throws a memory error due to the large files I am dealing with.

I have a few ideas of how I could try and get rid of it but I'm not sure what will work or/and might slow things down to a crawl.

Code so far:

import os
import timestart_time = time.time()#create new txt file in smae folder as python scriptopen("domain.txt","w").close()"""create concatenated document of all tecplot output files"""
#look into file number 1for folder in range(1,6,1): folder = str(folder)for name in os.listdir(folder):if "domain" in name:with open(folder+'/'+name) as file_content_list:start = ""for line in file_content_list:start = start + line# + '\n' with open('domain.txt','a') as f:f.write(start)#  print start#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alterationwith open("domain.txt", "r") as combined_output:start = ""start_timestep = 0time_increment = 3.154e10for line in combined_output:if "ZONE" in line:start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'elif "STRANDID" in line:start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'start_timestep = start_timestep + time_incrementelse:start = start + linewith open('domain_final.txt','w') as f:f.write(start)end_time = time.time()
print 'runtime : ', end_time-start_timeos.remove("domain.txt")

So far, I get the memory error at the concatenation stage.

To improve I could:

1) Try and do the corrections on the go as I read each file, but since it's already failing to go through an entire one I don't think that would make much of a difference other than computing time

2) Load all the file as into an array and make a function of the checks and run that function on the array:

Something like:

def do_correction(line):if "ZONE" in line:return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'elif "STRANDID" in line:return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'else:return line

3) keep it as is and ask Python to indicate when it is about to run out of memory and write to the file at that stage. Anyone knows if that is possible ?

Thank you for your help

Answer

It is not necessary to read the entire contents of each file into memory before writing to the output file. Large files will just consume, possibly all, available memory.

Simply read and write one line at a time. Also open the output file once only... and choose a name that will not be picked up and treated as an input file itself, otherwise you run the risk of concatenating the output file onto itself (not yet a problem, but could be if you also process files from the current directory) - if loading it doesn't already consume all memory.

import os.pathwith open('output.txt', 'w') as outfile:for folder in range(1,6,1): for name in os.listdir(folder):if "domain" in name:with open(os.path.join(str(folder), name)) as file_content_list:for line in file_content_list:# perform corrections/modifications to line hereoutfile.write(line)

Now you can process the data in a line oriented manner - just modify it before writing to the output file.

https://en.xdnf.cn/q/117855.html

Related Q&A

python assign literal value of a dictionary to key of another dictionary

I am trying to form a web payload for a particular request body but unable to get it right. What I need is to pass my body data as below data={file-data:{"key1": "3","key2&quo…

python regex findall span

I wanna find all thing between <span class=""> and </span> p = re.compile(<span class=\"\">(.*?)\</span>, re.IGNORECASE) text = re.findall(p, z)for exampl…

Why cant I view updates to a label while making an HTTP request in Python

I have this code :def on_btn_login_clicked(self, widget):email = self.log_email.get_text()passw = self.log_pass.get_text()self.lbl_status.set_text("Connecting ...")params = urllib.urlencode({…

plotting multiple graph from a csv file and output to a single pdf/svg

I have some csv data in the following format.Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09 L0 St vT 4R 0 0 0 0 0 0…

parallel python: just run function n times [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 9…

how to specify the partition for mapPartition in spark

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do …

Keeping just the hh:mm:ss from a time delta

I have a column of timedeltas which have the attributes listed here. I want the output in my pandas table to go from:1 day, 13:54:03.0456to:13:54:03How can I drop the date from this output?

How to return the index of numpy ndarray based on search?

I have a numpy 2D array, import numpy as np array1 = array([[ 1, 2, 1, 1],[ 2, 2, 2, 1],[ 1, 1, 1, 1],[1, 3, 1, 1],[1, 1, 1, 1]])I would like to find the element 3 and know its location. So,…

Python:Christmas Tree

I need to print a Christmas tree that looks like this:/\ / \ / \Here is my code so far:for count in range (0,20):variable1 = count-20variable2 = count*2print({0:{width1}}{1:{width2}} .format(/,\\,…

Send back json to client side

I just started developing with cherrypy, so I am struggling a little bit. In client side I am selecting some data, converting it to json and sending to server side via post method. Then I am doing a fe…