Here's the phony data:
df = pd.DataFrame({'email': ['u1','u1','u1','u2','u2','u2'],'timestamp': [3, 1, 5, 11, 15, 9]})
What I intend to retrieve is the time diff in each group of email. Thus, after sorting by timestamp in each group, the data should be:
u1 5
u1 3
u1 1
u2 15
u2 11
u2 9
the result should be:
u1 2 # 5-3
u1 2 # 3-1
u2 4 # 15-11
u2 2 # 11-9
Could anyone tell me what I should do next? Great thanks.
df = pd.DataFrame({'email': ['u1','u1','u1','u2','u2','u2'],'timestamp': [3, 1, 5, 11, 15, 9]})(df.sort_values(['email', 'timestamp'], ascending=[True, False]).groupby('email')['timestamp'].diff(-1).dropna())
Out:
2 2.0
0 2.0
4 4.0
3 2.0
Name: timestamp, dtype: float64
To keep the email column:
df.sort_values(['email', 'timestamp'], ascending=[True, False], inplace=True)
df.assign(diff=df.groupby('email')['timestamp'].diff(-1)).dropna()
Out: email timestamp diff
2 u1 5 2.0
0 u1 3 2.0
4 u2 15 4.0
3 u2 11 2.0
If you don't want the timestamp column you can add .drop('timestamp', axis=1)
at the end.