Dask: create strictly increasing index

2024/10/6 4:07:43

As is well documented, Dask creates a strictly increasing index on a per partition basis when reset_index is called, resulting in duplicate indices over the whole set. What is the best way (e.g. computationally quickest) to create a strictly increasing index in Dask - which doesn't have to be consecutive - over the whole set? I was hoping map_partitions would pass in the partition number, but I don't think it does. Thanks.

EDIT

Thanks @MRocklin, I've got this far, but I need a little assistance on how to recombine my series with the original dataframe.

def create_increasing_index(ddf:dd.DataFrame):mps = int(len(ddf) / ddf.npartitions + 1000)values = ddf.index.valuesdef do(x, max_partition_size, block_id=None):length = len(x)if length == 0:raise ValueError("Does not work with empty partitions. Consider using dask.repartition.")start = block_id[0] * max_partition_sizereturn da.arange(start, start+length, chunks=1)series = values.map_blocks(do, max_partition_size=mps, dtype=np.int64)ddf2 = dd.concat([ddf, dd.from_array(series)], axis=1)return ddf2

Where I'm getting the error "ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1". Is there a better way than using dd.concat? Thanks.

EDIT AGAIN

Actually, for my purposes (and amounts of data that I was testing on - only a few gb) cumsum is fast enough. I'll revisit when this becomes too slow!

Answer

A rather slow way of accomplishing this would be to create a new column and then use cumsum

ddf['x'] = 1
ddf['x'] = ddf.x.cumsum()
ddf = ddf.set_index('x', sorted=True)

This is neither very slow nor is it free.

Given how your question is phrased I suspect that you are looking to just create a range for each partition that is separated by a very large value that you know to be larger than the largest number of rows. You're right that map_partitions doesn't provide the partition number. You could do one of the two solutions below instead.

  1. Convert to a dask.array (with .values), use the map_blocks method, which does provide a block index, and then convert back to a series with dd.from_array.
  2. Convert to a list of dask.delayed objects, create the delayed series' yourself, and then convert back to a dask series with dd.from_delayed

http://dask.pydata.org/en/latest/delayed-collections.html

https://en.xdnf.cn/q/70411.html

Related Q&A

Installing hunspell package

Im looking forward to install the hunspell package using pip, but it throws the following error:Collecting hunspellUsing cached hunspell-0.4.1.tar.gz Building wheels for collected packages: hunspellRun…

Flask-Restful taking over exception handling from Flask during non debug mode

Ive used Flasks exception handling during development (@app.errorhander(MyException)) which worked fine even for exceptions coming from Flask-Restful endpoints.However, I noticed that when switching to…

Fetching data with snowflake connector throws EmptyPyArrowIterator error

I use python snowflake connector in my python script (plotly dash app) and today the app stopped working without me changing the code. I tried a couple of things to find out what might be the issue and…

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it requ…

TensorFlow 2.0 How to get trainable variables from tf.keras.layers layers, like Conv2D or Dense

I have been trying to get the trainable variables from my layers and cant figure out a way to make it work. So here is what I have tried:I have tried accessing the kernel and bias attribute of the Dens…

Convert Excel row,column indices to alphanumeric cell reference in python/openpyxl

I want to convert the row and column indices into an Excel alphanumeric cell reference like A1. Im using python and openpyxl, and I suspect theres a utility somewhere in that package that does this, bu…

Flask-admin - how to change formatting of columns - get URLs to display

Question on flask-admin. I setup flask-admin and one of the models i created is pulling urls and url titles from a mysql database. Using flask-admin, how to i get flask-admin to render the urls instea…

Stream audio from pyaudio with Flask to HTML5

I want to stream the audio of my microphone (that is being recorded via pyaudio) via Flask to any client that connects.This is where the audio comes from:def getSound(self):# Current chunk of audio dat…

Adding into Path var while silent installation of Python - possible bug?

I need to passively install Python in my applications package installation so i use the following:python-3.5.4-amd64.exe /passive PrependPath=1according this: 3.1.4. Installing Without UI I use the Pre…

Pandas add new columns based on splitting another column

I have a pandas dataframe like the following:A B US,65,AMAZON 2016 US,65,EBAY 2016My goal is to get to look like this:A B country code com US.65.AMAZON 2016…