Fastest way to concatenate multiple files column wise - Python

2024/10/18 14:48:39

What is the fastest method to concatenate multiple files column wise (within Python)?

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

Method 1: Cheating with paste

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):file1 = os.path.join(file_path, file1)file2 = os.path.join(file_path, file2)output = os.path.join(output_path, output)if not os.path.exists(output):os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:for line1, line2 in zip(fin1, fin2):fout.write(line1 + '\t' + line2)

Method 3: Using fileinput

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):with fileinput.input(files=(file1, file2)) as f:for line in f:line1, line2 = process(line)fout.write(line1 + '\t' + line2)

Method 4: Treat them like csv

with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:writer = csv.writer(w)reader1, reader2 = csv.reader(fin1), csv.reader(fin2)for line1, line2 in zip(reader1, reader2):writer.writerow(line1 + '\t' + line2)

Given the data size, which would be the fastest?

Why would one choose one over the other? Would I lose or add information?

For each method how would I choose a different delimiter other than , or \t?

Are there other ways of achieving the same concatenation column wise? Are they as fast?


From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

Changes from the original implementation:

  • Use iterators if possible, otherwise memory consumption will be penalized and you have to handle the whole file at once. (mainly if you are using python 2, instead of using zip use itertools.izip)
  • When you are concatenating strings, use "%s%s".format() or similar; otherwise you generate one new string instance each time.
  • There's no need of writing line by line inside the for. You can use an iterator inside the write.
  • Small buffers are very interesting but if we are using iterators the difference is very small, but if we try to fetch all data at once (so, for example, we put f1.readlines(1024*1000), it's much slower).


def concat_iter(file1, file2, output):with open(output, 'w', 1024) as fo, \open(file1, 'r') as f1, \open(file2, 'r') as f2:fo.write("".join("{}\t{}".format(l1, l2) for l1, l2 in izip(f1.readlines(1024), f2.readlines(1024))))

Profiler original solution.

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile 
10000006 function calls in 5.208 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    5.208    5.208<module>)1    2.422    2.422    5.208    5.208    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}**9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}1    1.072    1.072    1.072    1.072 {zip}


~/personal/python-algorithms/files$ python -m cProfile 3731 function calls in 0.002 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    0.002    0.002<module>)1    0.000    0.000    0.002    0.002    0.001    0.000    0.001    0.000<genexpr>)1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}**1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile 
843 function calls (842 primitive calls) in 0.001 seconds

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python
Command being timed: "python"
User time (seconds): 0.01
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0$ /usr/bin/time -v python 
Command being timed: "python"
User time (seconds): 5.64
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
File system inputs: 0
File system outputs: 327696

