Question 1

I am trying to concatenate model output files, the model run was broken up in 5 and each output corresponds to one of those partial run, due to the way the software outputs to file it start relabelling from 0 on each of the file outputs. I wrote some code to:

1) concatenate all the output files together 2) edit the merged file to re-label all timesteps, starting at 0 and increasing by an increment at each one.

The aim is that I can load this single file into my visualization software in one chunk, rather than open 5 different windows.

So far my code throws a memory error due to the large files I am dealing with.

I have a few ideas of how I could try and get rid of it but I'm not sure what will work or/and might slow things down to a crawl.

Code so far:

import os
import timestart_time = time.time()#create new txt file in smae folder as python scriptopen("domain.txt","w").close()"""create concatenated document of all tecplot output files"""
#look into file number 1for folder in range(1,6,1): folder = str(folder)for name in os.listdir(folder):if "domain" in name:with open(folder+'/'+name) as file_content_list:start = ""for line in file_content_list:start = start + line# + '\n' with open('domain.txt','a') as f:f.write(start)#  print start#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alterationwith open("domain.txt", "r") as combined_output:start = ""start_timestep = 0time_increment = 3.154e10for line in combined_output:if "ZONE" in line:start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'elif "STRANDID" in line:start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'start_timestep = start_timestep + time_incrementelse:start = start + linewith open('domain_final.txt','w') as f:f.write(start)end_time = time.time()
print 'runtime : ', end_time-start_timeos.remove("domain.txt")

So far, I get the memory error at the concatenation stage.

To improve I could:

1) Try and do the corrections on the go as I read each file, but since it's already failing to go through an entire one I don't think that would make much of a difference other than computing time

2) Load all the file as into an array and make a function of the checks and run that function on the array:

Something like:

def do_correction(line):if "ZONE" in line:return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'elif "STRANDID" in line:return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'else:return line

3) keep it as is and ask Python to indicate when it is about to run out of memory and write to the file at that stage. Anyone knows if that is possible ?

Thank you for your help

Question 2

It is not necessary to read the entire contents of each file into memory before writing to the output file. Large files will just consume, possibly all, available memory.

Simply read and write one line at a time. Also open the output file once only... and choose a name that will not be picked up and treated as an input file itself, otherwise you run the risk of concatenating the output file onto itself (not yet a problem, but could be if you also process files from the current directory) - if loading it doesn't already consume all memory.

import os.pathwith open('output.txt', 'w') as outfile:for folder in range(1,6,1): for name in os.listdir(folder):if "domain" in name:with open(os.path.join(str(folder), name)) as file_content_list:for line in file_content_list:# perform corrections/modifications to line hereoutfile.write(line)

Now you can process the data in a line oriented manner - just modify it before writing to the output file.

Memory Error Python Processing Large File Line by Line

Related Q&A

python assign literal value of a dictionary to key of another dictionary

python regex findall span

Why cant I view updates to a label while making an HTTP request in Python

plotting multiple graph from a csv file and output to a single pdf/svg

parallel python: just run function n times [closed]

how to specify the partition for mapPartition in spark

Keeping just the hh:mm:ss from a time delta

How to return the index of numpy ndarray based on search?

Python:Christmas Tree

Send back json to client side