Fastest way to concatenate multiple files column wise - Python

2024/10/18 14:48:39

What is the fastest method to concatenate multiple files column wise (within Python)?

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

Method 1: Cheating with paste

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):file1 = os.path.join(file_path, file1)file2 = os.path.join(file_path, file2)output = os.path.join(output_path, output)if not os.path.exists(output):os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:for line1, line2 in zip(fin1, fin2):fout.write(line1 + '\t' + line2)

Method 3: Using fileinput

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):with fileinput.input(files=(file1, file2)) as f:for line in f:line1, line2 = process(line)fout.write(line1 + '\t' + line2)

Method 4: Treat them like csv

with open(output, 'wb') as fout:with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:writer = csv.writer(w)reader1, reader2 = csv.reader(fin1), csv.reader(fin2)for line1, line2 in zip(reader1, reader2):writer.writerow(line1 + '\t' + line2)

Given the data size, which would be the fastest?

Why would one choose one over the other? Would I lose or add information?

For each method how would I choose a different delimiter other than , or \t?

Are there other ways of achieving the same concatenation column wise? Are they as fast?

Answer

From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

Changes from the original implementation:

  • Use iterators if possible, otherwise memory consumption will be penalized and you have to handle the whole file at once. (mainly if you are using python 2, instead of using zip use itertools.izip)
  • When you are concatenating strings, use "%s%s".format() or similar; otherwise you generate one new string instance each time.
  • There's no need of writing line by line inside the for. You can use an iterator inside the write.
  • Small buffers are very interesting but if we are using iterators the difference is very small, but if we try to fetch all data at once (so, for example, we put f1.readlines(1024*1000), it's much slower).

Example:

def concat_iter(file1, file2, output):with open(output, 'w', 1024) as fo, \open(file1, 'r') as f1, \open(file2, 'r') as f2:fo.write("".join("{}\t{}".format(l1, l2) for l1, l2 in izip(f1.readlines(1024), f2.readlines(1024))))

Profiler original solution.

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}**9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}1    1.072    1.072    1.072    1.072 {zip}

Profiler:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 3731 function calls in 0.002 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}**1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**3    0.000    0.000    0.000    0.000 {open}

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696
https://en.xdnf.cn/q/72822.html

Related Q&A

Can autograd in pytorch handle a repeated use of a layer within the same module?

I have a layer layer in an nn.Module and use it two or more times during a single forward step. The output of this layer is later inputted to the same layer. Can pytorchs autograd compute the grad of t…

Altering numpy function output array in place

Im trying to write a function that performs a mathematical operation on an array and returns the result. A simplified example could be:def original_func(A):return A[1:] + A[:-1]For speed-up and to avoi…

Does the E-factory of lxml support dynamically generated data?

Is there a way of creating the tags dynamically with the E-factory of lxml? For instance I get a syntax error for the following code:E.BODY(E.TABLE(for row_num in range(len(ws.rows)):row = ws.rows[row…

Check if datetime object in pandas has a timezone?

Im importing data into pandas and want to remove any timezones – if theyre present in the data. If the data has a time zone, the following code works successfully: col = "my_date_column" df[…

Extract translator comments with xgettext from JavaScript (in Python mode)

I have a pretty well-working command that extracts strings from all my .js and .html files (which are just Underscore templates). However, it doesnt seem to work for Translator comments.For example, I …

Embedding python + numpy code into C++ dll callback

I am new of python embedding. I am trying to embed python + numpy code inside a C++ callback function (inside a dll)the problem i am facing is the following. if i have:Py_Initialize(); // some python g…

How to parse single file using Python bindings to Clang?

I am writing a simple tool to help with refactoring the source code of our application. I would like to parse C++ code based on wxWidgets library, which defines GUI and produce XML .ui file to use with…

How can I profile a Kivy application?

Im building a game using Kivy. Im encountering performance issues so I decided to profile the program.I tried to run it by:python -m cProfile main.pyThe application screen stays black. After several se…

Set up multiple python installations on windows with tox

I am trying to set up tox on windows to run tests against multiple python installations. I have installed each python in folders named, C:\Python\PythonXX_YY, XX is the python version (e.g. 27) and YY…

How can I change the alpha value dynamically in matplotlib python

Im seeking how to change an alpha value dynamically which are already plotted.This is a kind of sample code I want to implement, but I know it is a wrong writing.import matplotlib.pyplot as pltfig = pl…