Divide .csv file into chunks with Python

2024/9/28 17:31:36

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

I wrote the following code:

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:name = '/output/to/this/directory/file_%s.csv' %s countchunk.to_csv(name,header=None,index=None)print(count)count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

Is there a better way?

EDIT

I have written the following iterative solution:

with open('/path/to/really/big.csv', 'r') as csvfile:read_rows = csv.reader(csvfile)file_count = 1row_count = 1f = open('/output/to/this/directory/file_%s.csv' %s count,'w')for row in read_rows:f.write(''.join(row))row_count+=1if row_count % 100000000 == 0:f.close()file_count += 1f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

Answer

there is an existing tool for this in Unix/Linux.

split -l 100000 -d source destination

will add two digit numerical suffix to destination prefix for the chunks.

https://en.xdnf.cn/q/71310.html

Related Q&A

Why cant I use operator.itemgetter in a multiprocessing.Pool?

The following program:import multiprocessing,operator f = operator.itemgetter(0) # def f(*a): return operator.itemgetter(0)(*a) if __name__ == __main__:multiprocessing.Pool(1).map(f, ["ab"])f…

Writing a compiler for a DSL in python

I am writing a game in python and have decided to create a DSL for the map data files. I know I could write my own parser with regex, but I am wondering if there are existing python tools which can do …

How to setup Celery to talk ssl to Azure Redis Instance

Using the great answer to "How to configure celery-redis in django project on microsoft azure?", I can configure Celery to use Azure Redis Cache using the non-ssl port, 6379, using the follo…

Cant save data from yfinance into a CSV file

I found library that allows me to get data from yahoo finance very efficiently. Its a wonderful library.The problem is, I cant save the data into a csv file.Ive tried converting the data to a Panda Da…

silhouette coefficient in python with sklearn

Im having trouble computing the silhouette coefficient in python with sklearn. Here is my code :from sklearn import datasets from sklearn.metrics import * iris = datasets.load_iris() X = pd.DataFrame(i…

Force dask to_parquet to write single file

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dasks t…

Unable to get python embedded to work with zipd library

Im trying to embed python, and provide the dll and a zip of the python libraries and not use any installed python. That is, if a user doesnt have python, I want my code to work using the provided dll/…

Convert integer to a random but deterministically repeatable choice

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective o…

Using python opencv to load image from zip

I am able to successfully load an image from a zip:with zipfile.ZipFile(test.zip, r) as zfile:data = zfile.read(test.jpg)# how to open this using imread or imdecode?The question is: how can I open thi…

DeprecationWarning: Function when moving app (removed titlebar) - PySide6

I get when I move the App this Warning: C:\Qt\Login_Test\main.py:48: DeprecationWarning: Function: globalPos() const is marked as deprecated, please check the documentation for more information.self.dr…