Force dask to_parquet to write single file

2024/11/16 19:47:50

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file?

Answer

There is a reasons to have multiple files (in particular when a single big file doesn't fit in memory) but if you really need 1 only you could try this

import dask.dataframe as dd
import pandas as pd
import numpy as npdf = pd.DataFrame(np.random.randn(1_000,5))df = dd.from_pandas(df, npartitions=4)
df.repartition(npartitions=1).to_parquet("data")
https://en.xdnf.cn/q/71304.html

Related Q&A

Unable to get python embedded to work with zipd library

Im trying to embed python, and provide the dll and a zip of the python libraries and not use any installed python. That is, if a user doesnt have python, I want my code to work using the provided dll/…

Convert integer to a random but deterministically repeatable choice

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective o…

Using python opencv to load image from zip

I am able to successfully load an image from a zip:with zipfile.ZipFile(test.zip, r) as zfile:data = zfile.read(test.jpg)# how to open this using imread or imdecode?The question is: how can I open thi…

DeprecationWarning: Function when moving app (removed titlebar) - PySide6

I get when I move the App this Warning: C:\Qt\Login_Test\main.py:48: DeprecationWarning: Function: globalPos() const is marked as deprecated, please check the documentation for more information.self.dr…

Architecture solution for Python Web application

Were setting up a Python REST web application. Right now, were using WSGI, but we might do some changes to that in the future (using Twisted, for example, to improve on scalability or some other featu…

Capture image for processing

Im using Python with PIL and SciPy. i want to capture an image from a webcam then process it further using numpy and Scipy. Can somebody please help me out with the code.Here is the code there is a pre…

Loading Magnet LINK using Rasterbar libtorrent in Python

How would one load a Magnet link via rasterbar libtorrent python binding?

Python currying with any number of variables

I am trying to use currying to make a simple functional add in Python. I found this curry decorator here.def curry(func): def curried(*args, **kwargs):if len(args) + len(kwargs) >= func.__code__…

Python - Display rows with repeated values in csv files

I have a .csv file with several columns, one of them filled with random numbers and I want to find duplicated values there. In case there are - strange case, but its what I want to check after all -, I…

Defining __getattr__ and __getitem__ on a function has no effect

Disclaimer This is just an exercise in meta-programming, it has no practical purpose.Ive assigned __getitem__ and __getattr__ methods on a function object, but there is no effect...def foo():print &quo…