Spark: equivelant of zipwithindex in dataframe

2024/10/6 22:12:12

Assuming I am having the following dataframe:

dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)]
df = sc.parallelize(dummy_data).toDF(['letter','number'])

And i want to create the following dataframe:

[('a',0),('b',2),('c',1),('d',3),('e',0)]

What I do is to convert it to rdd and use zipWithIndex function and after join the results:

convertDF = (df.select('number').distinct().rdd.zipWithIndex().map(lambda x:(x[0].number,x[1])).toDF(['old','new']))finalDF = (df.join(convertDF,df.number == convertDF.old).select(df.letter,convertDF.new))

Is if there is something similar function as zipWIthIndex in dataframes? Is there another more efficient way to do this task?

Answer

Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.

Here's a workaround though in PySpark:

def dfZipWithIndex (df, offset=1, colName="rowId"):'''Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe and preserves a schema:param df: source dataframe:param offset: adjustment to zipWithIndex()'s index:param colName: name of the index column'''new_schema = StructType([StructField(colName,LongType(),True)]        # new added field in front+ df.schema.fields                            # previous schema)zipped_rdd = df.rdd.zipWithIndex()new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))return spark.createDataFrame(new_rdd, new_schema)

That's also available in abalon package.

https://en.xdnf.cn/q/73184.html

Related Q&A

How to find list comprehension in python code

I want to find a list comprehension in python source code, for that I tried to use Pygments, but it didnt find the way to do that. To be more specific, I want to do a function that recognize all the po…

Save XLSX file to a specified location using OpenPyXL

Im having an issue saving my file to a certain location on my Raspberry PI (Raspbian) computer. Im wanting the XLSX file to be saved directly to my desktop rather than the folder holding the Python Sc…

Pandas read csv dateint columns to datetime

Im new to both StackOverflow and pandas. I am trying to read in a large CSV file with stock market bin data in the following format:date,time,open,high,low,close,volume,splits,earnings,dividends,sym 20…

Pydantic - Dynamically create a model with multiple base classes?

From the pydantic docs I understand this: import pydanticclass User(pydantic.BaseModel):id: intname: strclass Student(pydantic.BaseModel):semester: int# this works as expected class Student_User(User, …

Handling nested elements with Python lxml

Given the simple XML data below:<book><title>My First Book</title><abstract><para>First paragraph of the abstract</para><para>Second paragraph of the abstract&…

Easiest way to plot data on country map with python

Could not delete question. Please refer to question: Shade states of a country according to dictionary values with Basemap I want to plot data (number of sick people for a certain year) on each state o…

How to resize QMainWindow after removing all DockWidgets?

I’m trying to make an application consisting of a QMainWindow, the central widget of which is a QToolBar (it may not be usual, but for my purpose the toolbar’s well suited). Docks are allowed below o…

Python: sorting a list by column [duplicate]

This question already has answers here:How to sort a list/tuple of lists/tuples by the element at a given index(11 answers)Closed 8 years ago.How can I sort a list-of-lists by "column", i.e. …

How to make setuptools clone git dependencies recursively?

I want to let setuptools install Phoenix in my project and thus addedsetup(...dependency_links = ["git+https://github.com/wxWidgets/Phoenix.git#egg=Phoenix"],install_requires = ["Phoenix…

Stable sorting in Jinja2

It is possible to apply the sort filter in Jinja2 successively to sort a list first by one attribute, then by another? This seems like a natural thing to do, but in my testing, the preceeding sort is …