I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).
I wrote the following code:
import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:name = '/output/to/this/directory/file_%s.csv' %s countchunk.to_csv(name,header=None,index=None)print(count)count+=1
This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.
Is there a better way?
EDIT
I have written the following iterative solution:
with open('/path/to/really/big.csv', 'r') as csvfile:read_rows = csv.reader(csvfile)file_count = 1row_count = 1f = open('/output/to/this/directory/file_%s.csv' %s count,'w')for row in read_rows:f.write(''.join(row))row_count+=1if row_count % 100000000 == 0:f.close()file_count += 1f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
EDIT 2
I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.