I am analysing extreme weather events. My Dataframe is called df and looks like this:
| Date | Qm |
|------------|--------------|
| 1993-01-01 | 4881.977061 |
| 1993-02-01 | 4024.396839 |
| 1993-03-01 | 3833.664650 |
| 1993-04-01 | 4981.192526 |
| 1993-05-01 | 6286.879798 |
| 1993-06-01 | 6939.726070 |
| 1993-07-01 | 6492.936065 |
| ... | ... |
I want to know whether the extreme events happened in the same year as an outlier measured. Thus, I did my boxplot using seaborn:
# Qm boxplot analysisboxplot = sns.boxplot(x=df.index.month,y=df['Qm'])
plt.show()
Now, I would like to present within the same figure the years corresponding to the outliers. Hence, label them with their date.
I have checked in multiple libraries that include boxplots, but there is no clue on how to label them.
PD: I used seaborn in this example, but any library that could help will be highly appreciated
Thanks!
You could iterate through the dataframe and compare each value against the limits for the outliers. Default these limits are 1.5 times the IQR past the low and high quartiles. For each value outside that range, you can plot the year next to it. Feel free to adapt this definition if you would like to display more or less years.
Here is some code to illustrate the idea. In the code the two last digits of the year are shown next to the position of the outlier.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as snsY = 26
df = pd.DataFrame({'Date': pd.date_range('1993-01-01', periods=12 * Y, freq='M'),'Qm': np.random.normal(np.tile(5000 + 1000 * np.sin(np.linspace(0, 2 * np.pi, 12)), Y), 1000)})
df.set_index('Date', inplace=True)
boxplot = sns.boxplot(x=df.index.month, y=df['Qm'])
month_q1 = df.groupby(df.index.month).quantile(0.25)['Qm'].to_numpy()
month_q3 = df.groupby(df.index.month).quantile(0.75)['Qm'].to_numpy()
outlier_top_lim = month_q3 + 1.5 * (month_q3 - month_q1)
outlier_bottom_lim = month_q1 - 1.5 * (month_q3 - month_q1)for row in df.itertuples():month = row[0].month - 1val = row.Qmif val > outlier_top_lim[month] or val < outlier_bottom_lim[month]:plt.text(month, val, f' {row[0].year % 100:02d}', ha='left', va='center')
plt.xlabel('Month')
plt.tight_layout()
plt.show()