Label outliers in a boxplot - Python
Asked Answered
A

2

7

I am analysing extreme weather events. My Dataframe is called df and looks like this:

|    Date    |      Qm      |
|------------|--------------|                                              
| 1993-01-01 |  4881.977061 |
| 1993-02-01 |  4024.396839 |
| 1993-03-01 |  3833.664650 |
| 1993-04-01 |  4981.192526 |
| 1993-05-01 |  6286.879798 |  
| 1993-06-01 |  6939.726070 |
| 1993-07-01 |  6492.936065 |
|    ...     |      ...     |

I want to know whether the extreme events happened in the same year as an outlier measured. Thus, I did my boxplot using seaborn:

# Qm boxplot analysis

boxplot = sns.boxplot(x=df.index.month,y=df['Qm'])
plt.show()

Boxplot obtained

Now, I would like to present within the same figure the years corresponding to the outliers. Hence, label them with their date.

I have checked in multiple libraries that include boxplots, but there is no clue on how to label them.

PD: I used seaborn in this example, but any library that could help will be highly appreciated

Thanks!

Airs answered 11/5, 2020 at 16:15 Comment(0)
B
11

You could iterate through the dataframe and compare each value against the limits for the outliers. Default these limits are 1.5 times the IQR past the low and high quartiles. For each value outside that range, you can plot the year next to it. Feel free to adapt this definition if you would like to display more or less years.

Here is some code to illustrate the idea. In the code the two last digits of the year are shown next to the position of the outlier.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

Y = 26
df = pd.DataFrame({'Date': pd.date_range('1993-01-01', periods=12 * Y, freq='M'),
                   'Qm': np.random.normal(np.tile(5000 + 1000 * np.sin(np.linspace(0, 2 * np.pi, 12)), Y), 1000)})
df.set_index('Date', inplace=True)
boxplot = sns.boxplot(x=df.index.month, y=df['Qm'])
month_q1 = df.groupby(df.index.month).quantile(0.25)['Qm'].to_numpy()
month_q3 = df.groupby(df.index.month).quantile(0.75)['Qm'].to_numpy()
outlier_top_lim = month_q3 + 1.5 * (month_q3 - month_q1)
outlier_bottom_lim = month_q1 - 1.5 * (month_q3 - month_q1)

for row in df.itertuples():
    month = row[0].month - 1
    val = row.Qm
    if val > outlier_top_lim[month] or val < outlier_bottom_lim[month]:
        plt.text(month, val, f' {row[0].year % 100:02d}', ha='left', va='center')
plt.xlabel('Month')
plt.tight_layout()
plt.show()

sample plot

Bohon answered 11/5, 2020 at 17:35 Comment(0)
R
0

I don't know of a way to hand labels to seaborn.boxplot or pandas.DataFrame.boxplot together with your data. As workaround you could annotate your plot manually with matplotlib's annotate function.

Here is an example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame( np.random.randn(50, 2), columns=['Col1', 'Col2'])
boxplot = df.boxplot(
    column=['Col1', 'Col2'],
    flierprops=dict(markerfacecolor='r', marker='s', label='not shown'))
boxplot.annotate(
    '1993',
    (1, -2.65),
    xytext=(0.3, 0.15),
    textcoords='axes fraction',
    arrowprops=dict(facecolor='black', arrowstyle='wedge'),
    fontsize=11)
plt.show()

The resulting plot:

<code>annotate</code> example

Roadway answered 11/5, 2020 at 17:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.