Grouping boxplots in seaborn when input is a DataFrame
Asked Answered
P

5

26

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.

Here we go with a reproducible example that fails:

import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
                   [10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
                  columns=['a1', 'a2', 'a3', 'a4', 'b'])

# display(df)
   a1  a2  a3  a4  b
0   2   4   5   6  1
1   4   5   6   7  2
2   5   4   5   5  1
3  10   4   7   8  2
4   9   3   4   6  2
5   3   3   4   4  1

#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)

What I get is something that completely ignores groupby option:

Failed groupby

Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :

sns.boxplot(df.a1, groupby=df.b)

seaborn that does not fail

So I would like to get all my columns in one plot (all columns come in a similar scale).

EDIT:

The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.

Perrie answered 13/8, 2014 at 11:24 Comment(0)
T
11

You can directly use sns.boxplot, an axes-level function, or sns.catplot with kind='box', a figure-level function. See Figure-level vs. axes-level functions for further details

sns.catplot has the col and row variable, which are used to create subplots / facets with a different variable.

The default palette is determined by the type of variable, continuous (numeric) or categorical, passed to hue.

As explained by @mwaskom, you have to melt the sample dataframe into its "long-form" where each column is a variable and each row is an observation.

Tested in python 3.12.0, pandas 2.1.2, matplotlib 3.8.1, seaborn 0.13.0

df_long = pd.melt(df, "b", var_name="a", value_name="c")

# display(df_long.head())
   b   a   c
0  1  a1   2
1  2  a1   4
2  1  a1   5
3  2  a1  10
4  2  a1   9

sns.boxplot

fig, ax = plt.subplots(figsize=(5, 5))
sns.boxplot(x="a", hue="b", y="c", data=df_long, ax=ax)
ax.spines[['top', 'right']].set_visible(False)
sns.move_legend(ax, bbox_to_anchor=(1, 0.5), loc='center left', frameon=False)

sns.catplot

Create the same plot as sns.boxplot with fewer lines of code.

g = sns.catplot(kind='box', data=df_long, x='a', y='c', hue='b', height=5, aspect=1)

Resulting Plot

enter image description here

Treachery answered 7/6, 2019 at 17:36 Comment(0)
N
27

As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..

However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:

df_long = pd.melt(df, "b", var_name="a", value_name="c")

Then it's very simple to plot:

sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")

enter image description here

Ninebark answered 13/8, 2014 at 14:39 Comment(1)
This gets occasional upvotes, but FWIW nested boxplots have been possible in sns.boxplot since 0.6.Ninebark
T
11

You can directly use sns.boxplot, an axes-level function, or sns.catplot with kind='box', a figure-level function. See Figure-level vs. axes-level functions for further details

sns.catplot has the col and row variable, which are used to create subplots / facets with a different variable.

The default palette is determined by the type of variable, continuous (numeric) or categorical, passed to hue.

As explained by @mwaskom, you have to melt the sample dataframe into its "long-form" where each column is a variable and each row is an observation.

Tested in python 3.12.0, pandas 2.1.2, matplotlib 3.8.1, seaborn 0.13.0

df_long = pd.melt(df, "b", var_name="a", value_name="c")

# display(df_long.head())
   b   a   c
0  1  a1   2
1  2  a1   4
2  1  a1   5
3  2  a1  10
4  2  a1   9

sns.boxplot

fig, ax = plt.subplots(figsize=(5, 5))
sns.boxplot(x="a", hue="b", y="c", data=df_long, ax=ax)
ax.spines[['top', 'right']].set_visible(False)
sns.move_legend(ax, bbox_to_anchor=(1, 0.5), loc='center left', frameon=False)

sns.catplot

Create the same plot as sns.boxplot with fewer lines of code.

g = sns.catplot(kind='box', data=df_long, x='a', y='c', hue='b', height=5, aspect=1)

Resulting Plot

enter image description here

Treachery answered 7/6, 2019 at 17:36 Comment(0)
A
8

Seaborn's groupby function takes Series not DataFrames, that's why it's not working.

As a work around, you can do this :

fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
    sns.boxplot(grp[1], ax=ax[i])

it gives : sns

Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]

   a1  a2  a3  a4
0   2   4   5   6
1   4   5   6   7
2   5   4   5   5
3  10   4   7   8
4   9   3   4   6
5   3   3   4   4

Hope this helps

Abrupt answered 13/8, 2014 at 14:16 Comment(0)
P
5

It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.

Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.

g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')

faceted seaborn boxplot

Pale answered 13/8, 2014 at 11:55 Comment(1)
It's actually not necessary to use FacetGrid directly if you want this kind of plot, you can use factorplot here too with col=b. (This isn't wrong, it's just more work than necessary).Ninebark
D
1

It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.

output_graph

Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):

combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)

if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)

graph_data: DataFrame = pd.melt(
    frame=cluster_data_df,
    id_vars=['cluster'],
    # value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
    # value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6'] 
    var_name='psychometric_test',
    value_name='standard deviations from the mean'
)

The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):

index cluster psychometric_tst standard deviations from the mean
0 0.0 outcome_var_1 -1.276182
1 0.0 outcome_var_1 -1.118813
2 0.0 outcome_var_1 -1.276182
9754 0.0 outcome_var_6 0.892548
9755 0.0 outcome_var_6 1.420480

If you want to use indices with melt:

graph_data: DataFrame = pd.melt(
    frame=cluster_data_df,
    id_vars=cluster_data_df.columns[-1],
    # value_vars=cluster_data_df.columns[:-1],
    var_name='psychometric_test',
    value_name='standard deviations from the mean'
)

And here's the graphing code: (Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):

# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")

# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
                     data=graph_data)

# set box alpha:
for patch in fig.ax.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .2))

# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
                       dodge=True, alpha=.25, zorder=1)

# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes  # custom method
legend_labels: List[str] = []
while i < cluster_n:
    label: str = f"cluster {i+1}, n = {cluster_info[i]}"
    legend_labels.append(label)
    i += 1
if -1 in cluster_info.keys():
    cluster_n += 1
    label: str = f"Unclustered, n = {cluster_info[-1]}"
    legend_labels.insert(0, label)

## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()

asds

Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.

  1. List item
Discourage answered 22/9, 2021 at 14:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.