Efficent way to split a large text file in python [duplicate]

2024/10/14 21:22:22

this is a previous question where to improve the time performance of a function in python i need to find an efficient way to split my text file

I have the following text file (more than 32 GB) not sorted

....................
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 272 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 274 593868.39 6734999.90 124.44 1,
0 275 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
.............................

the first and second columns are the ID (ex: 0 -273) of location of the x,y,z point in a grid.

def point_grid_id(x,y,minx,maxy,distx,disty):"""give id (row,col)"""col = int((x - minx)/distx)row = int((maxy - y)/disty)return (row, col)

the (minx, maxx) is the origin of my grid with size distx,disty. The the numbers of Id tiles are

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns 
from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)]
n = len(tiles_id)

I need to slice the ~32 GB file in n (= len(tiles_id)) numbers of files.

i can do this without sorting but reading n times the file. For this reason I wish to find an efficient splitting method for the file starting form (0,0) (= tiles_id[0]) . After that i can read only one time the splitted files.

Answer

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

Here it is, I think:

def temp_file_name(l):id0, id1 = l.split()[:2]return "tile_%s_%s.tmp" % (id0, id1)def split_file(name):ofiles = {}try:with open(name) as f:for l in f:if l:fn = temp_file_name(l)if fn not in ofiles:ofiles[fn] = open(fn, 'w')ofiles[fn].write(l)finally:for of in ofiles.itervalues():of.close()split_file('srcdata1.txt')

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):with open(name) as f:for l in f:if l:fn = temp_file_name(l)with open(fn, 'a') as of:of.write(l)

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.

https://en.xdnf.cn/q/69368.html

Related Q&A

Creating a unique id in a python dataclass

I need a unique (unsigned int) id for my python data class. This is very similar to this so post, but without explicit ctors. import attr from attrs import field from itertools import count @attr.s(aut…

How to get all the models (one for each set of parameters) using GridSearchCV?

From my understanding: best_estimator_ provides the estimator with highest score; best_score_ provides the score of the selected estimator; cv_results_ may be exploited to get the scores of all estimat…

How do I perform deep equality comparisons of two lists of tuples?

I want to compare two lists of tuples:larry = [(1,a), (2, b)] moe = [(2, b), (1, a)]such that the order of the items in the list may differ. Are there library functions to do this ?>> deep_equal…

Metadata-generation-failed when trying to install pygame [duplicate]

This question already has answers here:Python pygame not installing(3 answers)Closed last year.Trying to install pygame on python 3.11 using the following command "pip install pygame" and I a…

Why such a big pickle of a sklearn decision tree (30K times bigger)?

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator? I ran into this issue at work where a random forest estimator (with …

Buffer size for reading UDP packets in Python

I am trying to find out / adjust the size of network buffers:import socketsock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)sock.getsockopt(socket.SOL_SOCKET,socket.SO_RCVBUF) 212992What on earth i…

Why does datetime give different timezone formats for the same timezone?

>>> now = datetime.datetime.now(pytz.timezone(Asia/Tokyo)) >>> dt = datetime(now.year, now.month, now.day, now.hour, now.minute, now.second, now.microsecond, pytz.timezone(Asia/Tokyo)…

Connect with pyppeteer to existing chrome

I want to connect to an existing (already opened, by the user, without any extra flags) Chrome browser using pyppeteer so I would be able to control it. I can do almost every manual action before (for …

Combining asyncio with a multi-worker ProcessPoolExecutor and for async

My question is very similar to Combining asyncio with a multi-worker ProcessPoolExecutor - however a slight change (I believe its the async for) makes the excellent answers there unusuable for me.I am …

Convert UTF-8 to string literals in Python

I have a string in UTF-8 format but not so sure how to convert this string to its corresponding character literal. For example I have the string:My string is: Entre\xc3\xa9Example one:This code:uEntre\…