Question 1

this is a previous question where to improve the time performance of a function in python i need to find an efficient way to split my text file

I have the following text file (more than 32 GB) not sorted

....................
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 272 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 274 593868.39 6734999.90 124.44 1,
0 275 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
.............................

the first and second columns are the ID (ex: 0 -273) of location of the x,y,z point in a grid.

def point_grid_id(x,y,minx,maxy,distx,disty):"""give id (row,col)"""col = int((x - minx)/distx)row = int((maxy - y)/disty)return (row, col)

the (minx, maxx) is the origin of my grid with size distx,disty. The the numbers of Id tiles are

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns 
from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)]
n = len(tiles_id)

I need to slice the ~32 GB file in n (= len(tiles_id)) numbers of files.

i can do this without sorting but reading n times the file. For this reason I wish to find an efficient splitting method for the file starting form (0,0) (= tiles_id[0]) . After that i can read only one time the splitted files.

Question 2

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

Here it is, I think:

def temp_file_name(l):id0, id1 = l.split()[:2]return "tile_%s_%s.tmp" % (id0, id1)def split_file(name):ofiles = {}try:with open(name) as f:for l in f:if l:fn = temp_file_name(l)if fn not in ofiles:ofiles[fn] = open(fn, 'w')ofiles[fn].write(l)finally:for of in ofiles.itervalues():of.close()split_file('srcdata1.txt')

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):with open(name) as f:for l in f:if l:fn = temp_file_name(l)with open(fn, 'a') as of:of.write(l)

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.

Efficent way to split a large text file in python [duplicate]

Related Q&A

Creating a unique id in a python dataclass

How to get all the models (one for each set of parameters) using GridSearchCV?

How do I perform deep equality comparisons of two lists of tuples?

Metadata-generation-failed when trying to install pygame [duplicate]

Why such a big pickle of a sklearn decision tree (30K times bigger)?

Buffer size for reading UDP packets in Python

Why does datetime give different timezone formats for the same timezone?

Connect with pyppeteer to existing chrome

Combining asyncio with a multi-worker ProcessPoolExecutor and for async

Convert UTF-8 to string literals in Python