Question 1

What is the fastest method to concatenate multiple files column wise (within Python)?

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

Method 1: Cheating with paste

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):file1 = os.path.join(file_path, file1)file2 = os.path.join(file_path, file2)output = os.path.join(output_path, output)if not os.path.exists(output):os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:for line1, line2 in zip(fin1, fin2):fout.write(line1 + '\t' + line2)

Method 3: Using fileinput

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):with fileinput.input(files=(file1, file2)) as f:for line in f:line1, line2 = process(line)fout.write(line1 + '\t' + line2)

Method 4: Treat them like csv

with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:writer = csv.writer(w)reader1, reader2 = csv.reader(fin1), csv.reader(fin2)for line1, line2 in zip(reader1, reader2):writer.writerow(line1 + '\t' + line2)

Given the data size, which would be the fastest?

Why would one choose one over the other? Would I lose or add information?

For each method how would I choose a different delimiter other than , or \t?

Are there other ways of achieving the same concatenation column wise? Are they as fast?

Question 2

From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

Changes from the original implementation:

Use iterators if possible, otherwise memory consumption will be penalized and you have to handle the whole file at once. (mainly if you are using python 2, instead of using zip use itertools.izip)
When you are concatenating strings, use "%s%s".format() or similar; otherwise you generate one new string instance each time.
There's no need of writing line by line inside the for. You can use an iterator inside the write.
Small buffers are very interesting but if we are using iterators the difference is very small, but if we try to fetch all data at once (so, for example, we put f1.readlines(1024*1000), it's much slower).

Example:

def concat_iter(file1, file2, output):with open(output, 'w', 1024) as fo, \open(file1, 'r') as f1, \open(file2, 'r') as f2:fo.write("".join("{}\t{}".format(l1, l2) for l1, l2 in izip(f1.readlines(1024), f2.readlines(1024))))

Profiler original solution.

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}**9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}1    1.072    1.072    1.072    1.072 {zip}

Profiler:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 3731 function calls in 0.002 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}**1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696

Fastest way to concatenate multiple files column wise - Python

Related Q&A

Can autograd in pytorch handle a repeated use of a layer within the same module?

Altering numpy function output array in place

Does the E-factory of lxml support dynamically generated data?

Check if datetime object in pandas has a timezone?

Extract translator comments with xgettext from JavaScript (in Python mode)

Embedding python + numpy code into C++ dll callback

How to parse single file using Python bindings to Clang?

How can I profile a Kivy application?

Set up multiple python installations on windows with tox

How can I change the alpha value dynamically in matplotlib python