When using dask.to_parquet(df, filename)
a subfolder filename
is created and several files are written to that folder, whereas pandas.to_parquet(df, filename)
writes exactly one file.
Can I use dask's to_parquet
(without using compute()
to create a pandas df) to just write a single file?
There is a reasons to have multiple files (in particular when a single big file doesn't fit in memory) but if you really need 1 only you could try this
import dask.dataframe as dd
import pandas as pd
import numpy as npdf = pd.DataFrame(np.random.randn(1_000,5))df = dd.from_pandas(df, npartitions=4)
df.repartition(npartitions=1).to_parquet("data")