How to select columns using dynamic select query using window function

2024/9/22 16:50:11

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.

customer_id|month_id|m1  |m2 |m3 .......m_n
1001       |  01    |10  |20    
1002       |  01    |20  |30    
1003       |  01    |30  |40
1001       |  02    |40  |50    
1002       |  02    |50  |60    
1003       |  02    |60  |70
1001       |  03    |70  |80    
1002       |  03    |80  |90    
1003       |  03    |90  |100

Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.

For Example:

rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))

But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?

LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)

Or any other way of dataframe select I can create the select expression?

Output:

customer_id|month_id|m1     |m2    |n1   |n2  
1001       |  01    |10     |20    |10   |20  
1002       |  01    |20     |30    |20   |30  
1003       |  01    |30     |40    |30   |40  
1001       |  02    |40     |50    |50   |70  
1002       |  02    |50     |60    |70   |90
1003       |  02    |60     |70    |90   |110  
1001       |  03    |70     |80    |120  |150
1002       |  03    |80     |90    |150  |180
1003       |  03    |90     |100   |180  |210
Answer

with slight modification to @Lamanus suggestion the below code might be helpful to solve your problem,

# pyspark --driver-memory 1G --executor-memory 2G --executor-cores 1 --num-executors 1
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.sql.window import Windowdrow = Row("customer_id","month_id","m1","m2","m3","m4")
data=[drow("1001","01","10","20","10","20"),drow("1002","01","20","30","20","30"),drow("1003","01","30","40","30","40"),drow("1001","02","40","50","40","50"),drow("1002","02","50","60","50","60"),drow("1003","02","60","70","60","70"),drow("1001","03","70","80","70","80"),drow("1002","03","80","90","80","90"),drow("1003","03","90","100","90","100")]df = spark.createDataFrame(data)
df.show()
'''
+-----------+--------+---+---+---+---+
|customer_id|month_id| m1| m2| m3| m4|
+-----------+--------+---+---+---+---+
|       1001|      01| 10| 20| 10| 20|
|       1002|      01| 20| 30| 20| 30|
|       1003|      01| 30| 40| 30| 40|
|       1001|      02| 40| 50| 40| 50|
|       1002|      02| 50| 60| 50| 60|
|       1003|      02| 60| 70| 60| 70|
|       1001|      03| 70| 80| 70| 80|
|       1002|      03| 80| 90| 80| 90|
|       1003|      03| 90|100| 90|100|
+-----------+--------+---+---+---+---+
'''a = ["m1","m2"]
b = ["m3","m4"]
rownum_window = (Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
expr = ["*",sum(col("m1")).over(rownum_window).alias("sum1"), sum(col("m2")).over(rownum_window).alias("sum2"),avg(col("m3")).over(rownum_window).alias("avg1"), avg(col("m4")).over(rownum_window).alias("avg2") ]
df.select(expr).show()'''
+-----------+--------+---+---+---+---+-----+-----+----+----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|avg1|avg2|
+-----------+--------+---+---+---+---+-----+-----+----+----+
|       1003|      01| 30| 40| 30| 40| 30.0| 40.0|30.0|40.0|
|       1003|      02| 60| 70| 60| 70| 90.0|110.0|45.0|55.0|
|       1003|      03| 90|100| 90|100|180.0|210.0|60.0|70.0|
|       1002|      01| 20| 30| 20| 30| 20.0| 30.0|20.0|30.0|
|       1002|      02| 50| 60| 50| 60| 70.0| 90.0|35.0|45.0|
|       1002|      03| 80| 90| 80| 90|150.0|180.0|50.0|60.0|
|       1001|      01| 10| 20| 10| 20| 10.0| 20.0|10.0|20.0|
|       1001|      02| 40| 50| 40| 50| 50.0| 70.0|25.0|35.0|
|       1001|      03| 70| 80| 70| 80|120.0|150.0|40.0|50.0|
+-----------+--------+---+---+---+---+-----+-----+----+----+
'''
https://en.xdnf.cn/q/119115.html

Related Q&A

Downloading Books from website with python

Im downloading books from the website, and almost my code runs smoothly, but when I try to open the pdf Book on my PC. An error generated by Adobe Acrobat Reader that this is not supported file type.He…

Discord.py How can I make a bot delete messages after a specific amount of time

I have a discord bot that sends images to users when they use the !img command, I dont want people to request an image and then have it sit there until someone deletes it. Is there any way I can make i…

How to encode and decode a column in python pandas?

load = pd.DataFrame({A:list(abcdef),B:[4,5,4,5,5,4],C:[7,8,9,4,2,0],D:[1,3,5,4,2,0],E:[5,3,6,9,2,4],F:list(aaabbb)})How to encode and decode column F.Expected Output:Should have two more columns with e…

Pygame module not found [duplicate]

This question already has answers here:Why do I get a "ModuleNotFoundError" in VS Code despite the fact that I already installed the module?(23 answers)Closed 3 months ago.I have installed p…

Working with Lists and tuples

My data looks like:X=[1,2,3,4]But I need it to look like: Y=[(1,2,3,4)]How does one do this in python?

Football pygame, need help on timer [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.Improve…

Unable to Install GDAL Using PIP on Python

Im trying to install gdal in python3.8.8 (Windows 10) and im getting below error I have install Visual Studio Build Tools 2019 and reboot my PC Downgrade my Python from 3.9.5 to 3.8.8 C:\Program Files…

Append text to the last line of file with python

First, I use echo hello, >> a.txt to create a new file with one line looks like that. And I know \n is at the last of the line.Then I get some data from python, for example "world", I w…

Python Caesar Cipher [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 2…

Indent Expected? [duplicate]

This question already has answers here:Im getting an IndentationError (or a TabError). How do I fix it?(6 answers)Closed 7 months ago.Im sort of new to python and working on a small text adventure its…