I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.
customer_id|month_id|m1 |m2 |m3 .......m_n
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 02 |40 |50
1002 | 02 |50 |60
1003 | 02 |60 |70
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
Now, I have to create new columns based on the cummulative sum by grouping on each month. Hence, I have used window function. As, I will have n number of columns instead of withColumn with for loop, I need to create a query or list dynamically and pass it to the selectExpr to calculate the new columns.
For Example:
rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))
But, I want to prepare a dynamic expression and then I need to pass to the dataframe select. How can I do that?
LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)
Or any other way of dataframe select I can create the select expression?
Output:
customer_id|month_id|m1 |m2 |n1 |n2
1001 | 01 |10 |20 |10 |20
1002 | 01 |20 |30 |20 |30
1003 | 01 |30 |40 |30 |40
1001 | 02 |40 |50 |50 |70
1002 | 02 |50 |60 |70 |90
1003 | 02 |60 |70 |90 |110
1001 | 03 |70 |80 |120 |150
1002 | 03 |80 |90 |150 |180
1003 | 03 |90 |100 |180 |210
with slight modification to @Lamanus suggestion the below code might be helpful to solve your problem,
# pyspark --driver-memory 1G --executor-memory 2G --executor-cores 1 --num-executors 1
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.sql.window import Windowdrow = Row("customer_id","month_id","m1","m2","m3","m4")
data=[drow("1001","01","10","20","10","20"),drow("1002","01","20","30","20","30"),drow("1003","01","30","40","30","40"),drow("1001","02","40","50","40","50"),drow("1002","02","50","60","50","60"),drow("1003","02","60","70","60","70"),drow("1001","03","70","80","70","80"),drow("1002","03","80","90","80","90"),drow("1003","03","90","100","90","100")]df = spark.createDataFrame(data)
df.show()
'''
+-----------+--------+---+---+---+---+
|customer_id|month_id| m1| m2| m3| m4|
+-----------+--------+---+---+---+---+
| 1001| 01| 10| 20| 10| 20|
| 1002| 01| 20| 30| 20| 30|
| 1003| 01| 30| 40| 30| 40|
| 1001| 02| 40| 50| 40| 50|
| 1002| 02| 50| 60| 50| 60|
| 1003| 02| 60| 70| 60| 70|
| 1001| 03| 70| 80| 70| 80|
| 1002| 03| 80| 90| 80| 90|
| 1003| 03| 90|100| 90|100|
+-----------+--------+---+---+---+---+
'''a = ["m1","m2"]
b = ["m3","m4"]
rownum_window = (Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
expr = ["*",sum(col("m1")).over(rownum_window).alias("sum1"), sum(col("m2")).over(rownum_window).alias("sum2"),avg(col("m3")).over(rownum_window).alias("avg1"), avg(col("m4")).over(rownum_window).alias("avg2") ]
df.select(expr).show()'''
+-----------+--------+---+---+---+---+-----+-----+----+----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|avg1|avg2|
+-----------+--------+---+---+---+---+-----+-----+----+----+
| 1003| 01| 30| 40| 30| 40| 30.0| 40.0|30.0|40.0|
| 1003| 02| 60| 70| 60| 70| 90.0|110.0|45.0|55.0|
| 1003| 03| 90|100| 90|100|180.0|210.0|60.0|70.0|
| 1002| 01| 20| 30| 20| 30| 20.0| 30.0|20.0|30.0|
| 1002| 02| 50| 60| 50| 60| 70.0| 90.0|35.0|45.0|
| 1002| 03| 80| 90| 80| 90|150.0|180.0|50.0|60.0|
| 1001| 01| 10| 20| 10| 20| 10.0| 20.0|10.0|20.0|
| 1001| 02| 40| 50| 40| 50| 50.0| 70.0|25.0|35.0|
| 1001| 03| 70| 80| 70| 80|120.0|150.0|40.0|50.0|
+-----------+--------+---+---+---+---+-----+-----+----+----+
'''