Pyspark filling missing dates by group and filling previous values

2024/11/24 9:04:16

Spark version 3.0.

I have two dataframes.

I create one dataframe with date columns using pandas date range.

I have a 2nd spark dataframe contains the company name, dates and value.

I want to merge the DF2 to DF1 grouping it by company, so I can fill the missing dates, and also fill the missing value from the previous row.

How can i do this? I thought about left join, but it doesn't seem to work well.

enter image description here

Answer

Try this. A bit complex.

import pyspark.sql.functions as f
from pyspark.sql import Windowdf1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv") \.withColumn('Date', f.to_date('Date', 'dd/MM/yyyy'))
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv") \.withColumn('Date', f.to_date('Date', 'dd/MM/yyyy'))w1 = Window.orderBy('Company', 'Date')
w2 = Window.orderBy('Company', 'Date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
w3 = Window.partitionBy('partition').orderBy('Company', 'Date')df1.crossJoin(df2.select('Company').distinct()) \.join(df2, ['Company', 'Date'], 'left') \.withColumn('range', (f.col('Value').isNull() | f.lead(f.col('Value'), 1, 0).over(w1).isNull()) != f.col('Value').isNull()) \.withColumn('partition', f.sum(f.col('range').cast('int')).over(w2)) \.withColumn('fill', f.first('Value').over(w3)) \.orderBy('Company', 'Date') \.selectExpr('Company', 'Date', 'coalesce(Value, fill) as Value') \.show(20, False)+-------+----------+-----+
|Company|Date      |Value|
+-------+----------+-----+
|A      |2000-01-01|13   |
|A      |2000-01-02|14   |
|A      |2000-01-03|15   |
|A      |2000-01-04|19   |
|A      |2000-01-05|19   |
|A      |2000-01-06|19   |
|A      |2000-01-07|19   |
|A      |2000-01-08|19   |
|A      |2000-01-09|19   |
|B      |2000-01-01|19   |
|B      |2000-01-02|19   |
|B      |2000-01-03|20   |
|B      |2000-01-04|25   |
|B      |2000-01-05|23   |
|B      |2000-01-06|24   |
|B      |2000-01-07|24   |
|B      |2000-01-08|24   |
|B      |2000-01-09|24   |
+-------+----------+-----+

You can see what happen for each line by adding .show multiple times that might be helpful.

https://en.xdnf.cn/q/120552.html

Related Q&A

How to loop in the opposite order?

I am a beginner programmer. Here is my code:n = int(input()) from math import* for i in range(n):print(n, "\t", log10(n))i = i + 1n = n - 1Its output is:10 1.0 9 0.9542425094393249 8 …

Override methods with same name in Python programming [duplicate]

This question already has answers here:Closed 12 years ago.Possible Duplicate: How do I use method overloading in Python?I am new to Python programming, and I like to write multiple methods with the …

TypeError: function object is not subscriptable in Python 3.4.3?

I have a food menu and the stock and prices are in separate dictionaries.Food Stock:Food_Stock = {Chips : 15,Bagels : 27,Cookies : 25}#Food Stock.Food Prices:Food_Prices = {#Food Prices.Chips : 1,Bagel…

Insert a value in date format dd-mm-yyyy in dictionary in python

I am creating a dictionary having some values including a date of birth of a person. But when I run my code, it is giving an error "datetime.datetime has no attribute datetime" . Here is my …

Collisions arent registering with Python Turtles

Im creating a simple 2D shooter following an online tutorial, and the enemy sprites (except 1, there are 5 total) are not abiding by my collision code. Everything works except the bullet object that th…

Function should clean data to half the size, instead it enlarges it by an order of magnitude

This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each assets data is a value in dictionary. The data of interest is stored in lists…

is it possible to add colors to python output? [duplicate]

This question already has answers here:How do I print colored text to the terminal?(66 answers)Closed 10 years ago.so i made a small password strength tester for me, my friends and my family, as seen …

Tkinter Function attached to Button executed immediately [duplicate]

This question already has answers here:How can I pass arguments to Tkinter buttons callback command?(2 answers)Closed 8 years ago.What I need is to attach a function to a button that is called with a …

Error!!! cant concatenate the tuple to non float

stack = []closed = []currNode = problem.getStartState()stack.append(currNode)while (len(stack) != 0):node = stack.pop()if problem.isGoalState(node):print "true"closed.append(node)else:child =…

Using R to fit data from a csv with a gamma function?

Using the following data: time_stamp,secs,count 2013-04-30 23:58:55,1367366335,32 2013-04-30 23:58:56,1367366336,281 2013-04-30 23:58:57,1367366337,664 2013-04-30 23:58:58,1367366338,1255 2013-04-30…