I've got data called 'Planned Leave' which includes 'Start Date', 'End Date', 'User ID' and 'Leave Type'.
I want to be able to create a new data-frame which shows all days between Start and End Date, per 'User ID'.
So far, I've only been able to create a date_list which supplies a range of dates between start and end date, but I cannot find a way to include this for each 'User ID' and 'Leave Type'.
Here is my current function:
def datesplit(data):x = pd.DataFrame(columns=['Date'])for i in plannedleave.iterrows():start = data['Start Date'][i]end = data['End Date'][i]date_list = [start + dt.timedelta(days=x) for x in range((end-start).days)]x.append(date_list)return x>>> datesplit(plannedleave)
>>> Value Error: Can only Tuple-index with a MultiIndex
Here's what the data looks like:
>>> plannedleave.dtypes
>>>Employee ID int64First Name objectLast Name objectLeave Type objectStart Date datetime64[ns]End Date datetime64[ns]
dtype: object
I'd be forever grateful if you could find a solution here! :-)
Here are necessary loops, so I prefer DataFrame.itertuples
more like DataFrame.iterrows
for performance in list comprehension:
def datesplit(df):df1 = df.rename(columns={'Start Date':'sdate','End Date':'edate', 'Employee ID':'ID'})return (pd.concat([pd.Series(r.ID,pd.date_range(r.sdate, r.edate)) for r in df1.itertuples()]).rename_axis('Date').reset_index(name='Employee ID'))df = datesplit(plannedleave)
print (df)Date Employee ID
0 2020-05-10 1001
1 2020-05-11 1001
2 2020-05-12 1001
3 2020-05-13 1001
4 2020-05-14 1001
5 2020-05-15 1001
6 2020-05-18 1002
7 2020-05-19 1002
8 2020-05-20 1002
9 2020-05-21 1002
10 2020-05-22 1002
Performance with 200 rows:
plannedleave = pd.concat([plannedleave] * 100, ignore_index=True)def datesplit(df):df1 = df.rename(columns={'Start Date':'sdate','End Date':'edate', 'Employee ID':'ID'})return (pd.concat([pd.Series(r.ID,pd.date_range(r.sdate, r.edate)) for r in df1.itertuples()]).rename_axis('Date').reset_index(name='Employee ID'))def datesplitvb(data):parts = []for idx, row in data.iterrows():parts.append(pd.DataFrame(row['Employee ID'], columns=['Employee ID'],index=pd.date_range(start=row['Start Date'], end=row['End Date'],name='Date')))return pd.concat(parts).reset_index()In [152]: %timeit datesplit(plannedleave.copy())
98.2 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)In [153]: %timeit datesplitvb(plannedleave.copy())
193 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)