Row by row processing of a Dask DataFrame

2024/10/9 18:19:01

I need to process a large file and to change some values.

I would like to do something like that:

for index, row in dataFrame.iterrows():foo = doSomeStuffWith(row)lol = doOtherStuffWith(row)dataFrame['colx'][index] = foodataFrame['coly'][index] = lol

Bad for me, I cannot do dataFrame['colx'][index] = foo!

My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.

Other solutions are to manually break my data into chunks and to use pandas or to just throw anything in a database. But it could be nice if I may keep using my .csv and let dask do the chunk processing for me!

Thank for your help.

Answer

In general iterating over a dataframe, either Pandas or Dask, is likely to be quite slow. Additionally Dask won't support row-wise element insertion. This kind of workload is difficult to scale.

Instead I recommend using dd.Series.where (See this answer) or else doing your iteration in a function (after making a copy so as not to operate in place) and then using map_partitions to call that function across all of the Pandas dataframes in your Dask dataframe .

https://en.xdnf.cn/q/69990.html

Related Q&A

Tweepy Why did I receive AttributeError for search [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

Running qsub with anaconda environment

I have a program that usually runs inside a conda environmet in Linux, because I use it to manage my libraries, with this instructions:source activate my_environment python hello_world.pyHow can I run …

Flask Application was not able to create a URL adapter for request [duplicate]

This question already has answers here:Flask.url_for() error: Attempted to generate a URL without the application context being pushed(3 answers)Closed 10 months ago.I have this code which used to work…

Python typing deprecation

The latest typing docs has a lot of deprecation notices like the following: class typing.Deque(deque, MutableSequence[T]) A generic version of collections.deque.New in version 3.5.4.New in version 3.6.…

Cant install tensorflow with pip or anaconda

Does anyone know how to properly install tensorflow on Windows?Im currently using Python 3.7 (also tried with 3.6) and every time I get the same "Could not find a version that satisfies the requi…

send xml file to http using python

how can i send an xml file on my system to an http server using python standard library??

Why python Wnck window.activate(int(time.time()))

This to me is VERY strange. Could someone please explain why the activate() function should want a timestamp? Wouldnt 99.9% of the time be NOW or ASAP or "At your earliest convenience"? And…

Regex subsequence matching

Im using python but code in any language will do as well for this question.Suppose I have 2 strings. sequence =abcd string = axyzbdclkdIn the above example sequence is a subsequence of stringHow can I …

Read a Bytes image from Amazon Kinesis output in python

I used imageio.get_reader(BytesIO(a), ffmpeg) to load a bytes image and save it as normal image.But the below error throws when I read the image using imageio.get_reader(BytesIO(a), ffmpeg)Traceback …

How to compute optical flow using tvl1 opencv function

Im trying to find python example for computing optical flow with tvl1 opencv function createOptFlow_DualTVL1 but it seems that there isnt enough documentation for it.Could anyone please let me do that?…