With `pandas.cut()`, how do I get integer bins and avoid getting a negative lowest bound?

Asked 13/9, 2015 at 16:42 Answered 17/4, 2024 at 2:19

My dataframe has zero as the lowest value. I am trying to use the precision and include_lowest parameters of pandas.cut(), but I can't get the intervals consist of integers rather than floats with one decimal. I can also not get the left most interval to stop at zero.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='white', font_scale=1.3)

df = pd.DataFrame(range(0,389,8)[:-1], columns=['value'])
df['binned_df_pd'] = pd.cut(df.value, bins=7, precision=0, include_lowest=True)
sns.pointplot(x='binned_df_pd', y='value', data=df)
plt.xticks(rotation=30, ha='right')

I have tried setting precision to -1, 0 and 1, but they all output one decimal floats. The pandas.cut() help does mention that the x-min and x-max values are extended with 0.1 % of the x-range, but I thought maybe include_lowest could suppress this behaviour somehow. My current workaround involves importing numpy:

import numpy as np

bin_counts, edges = np.histogram(df.value, bins=7)
edges = [int(x) for x in edges]
df['binned_df_np'] = pd.cut(df.value, bins=edges, include_lowest=True)

sns.pointplot(x='binned_df_np', y='value', data=df)
plt.xticks(rotation=30, ha='right')

Is there a way to obtain non-negative integers as the interval boundaries directly with pandas.cut() without using numpy?

Edit: I just noticed that specifying right=False makes the lowest interval shift to 0 rather than -0.4. It seems to take precedence over include_lowest, as changing the latter does not have any visible effect in combination with right=False. The following intervals are still specified with one decimal point.

Forme answered 13/9, 2015 at 16:42 Comment(1)

A proposal to fix this behavior: github.com/pandas-dev/pandas/issues/47996 – Breeding 7/8, 2022 at 23:37

None of the other answers (including OP's np.histogram workaround) seem to work anymore. They have upvotes, so I'm not sure if something has changed over the years.

IntervalIndex requires all intervals to be closed identically, so [0, 53] cannot coexist with (322, 376].

Here are two working solutions based on the relabeling approach:

Without numpy, reuse pd.cut edges as pd.cut labels

bins = 7

_, edges = pd.cut(df.value, bins=bins, retbins=True)
labels = [f'({abs(edges[i]):.0f}, {edges[i+1]:.0f}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

With numpy, convert np.linspace edges into pd.cut labels

bins = 7

edges = np.linspace(df.value.min(), df.value.max(), bins+1).astype(int)
labels = [f'({edges[i]}, {edges[i+1]}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

Note: Only the labels are changed, so the underlying binning will still occur with 0.1% margins.

pointplot() output (as of pandas 1.2.4):

sns.pointplot(x='bin', y='value', data=df)
plt.xticks(rotation=30, ha='right')

Reset answered 6/6, 2021 at 15:26 Comment(0)

you should specifically set the labels argument

preparations:

lower, higher = df['value'].min(), df['value'].max()
n_bins = 7

build up the labels:

edges = range(lower, higher, (higher - lower)/n_bins) # the number of edges is 8
lbs = ['(%d, %d]'%(edges[i], edges[i+1]) for i in range(len(edges)-1)]

set labels:

df['binned_df_pd'] = pd.cut(df.value, bins=n_bins, labels=lbs, include_lowest=True)

Endocrinology answered 7/3, 2018 at 10:43 Comment(1)

Will this work if (higher-lower)/n_bins is not an integer? – Rogers 11/1, 2021 at 19:55

None of the other answers (including OP's np.histogram workaround) seem to work anymore. They have upvotes, so I'm not sure if something has changed over the years.

IntervalIndex requires all intervals to be closed identically, so [0, 53] cannot coexist with (322, 376].

Here are two working solutions based on the relabeling approach:

Without numpy, reuse pd.cut edges as pd.cut labels

bins = 7

_, edges = pd.cut(df.value, bins=bins, retbins=True)
labels = [f'({abs(edges[i]):.0f}, {edges[i+1]:.0f}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

With numpy, convert np.linspace edges into pd.cut labels

bins = 7

edges = np.linspace(df.value.min(), df.value.max(), bins+1).astype(int)
labels = [f'({edges[i]}, {edges[i+1]}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

Note: Only the labels are changed, so the underlying binning will still occur with 0.1% margins.

pointplot() output (as of pandas 1.2.4):

sns.pointplot(x='bin', y='value', data=df)
plt.xticks(rotation=30, ha='right')

Reset answered 6/6, 2021 at 15:26 Comment(0)

@joelostblom, you did most of the work already, instead of using numpy, just use what pandas already provide, which is returning bins.

_, edges = pd.cut(df.value, bins=7, retbins=True)
edges = [int(x) for x in edges]
df['binned_df_np'] = pd.cut(df.value, bins=edges, include_lowest=True)

Cuirbouilli answered 7/3, 2021 at 2:51 Comment(0)

You can have closed integer intervals as well. Let nbins = 7.

Find edges to cut (Pandas or Numpy).

# NumPy
edges = np.linspace(df.value.min(), df.value.max(), nbins + 1)
edges[-1] += 1

# Pandas
float_binned, edges = pd.cut(df.value, bins=nbins, right=False, retbins=True)
edges[-1] = df.values.max() + 1

For your data, this is: [ 0. , 53.71, 107.43, 161.14, 214.86, 268.57, 322.29, 377. ]

Make closed integer intervals from edges.

edges = edges.round()  # optional, for more uniform length of intervals
intervals = [pd.Interval(int(left), int(right) - 1, 'both')
             for left, right in zip(edges[:-1], edges[1:])]

For your data, this is:

[[0, 53], [54, 106], [107, 160], [161, 214], [215, 268], [269, 321], [322, 376]]

Cut data using the intervals.

int_binned = pd.cut(df.value, pd.IntervalIndex(intervals))

For your data, this is:

0        [0, 53]
1        [0, 53]
2        [0, 53]
...
45    [322, 376]
46    [322, 376]
47    [322, 376]
Name: value, dtype: category
Categories (7, interval[int64, both]): [[0, 53] < [54, 106] < [107, 160] < [161, 214] < [215, 268] < [269, 321] < [322, 376]]

Then you can make your plot:

df['binned_value'] = int_binned
sns.pointplot(x='binned_value', y='value', data=df)
plt.xticks(rotation=30, ha='right')

Zoospore answered 17/4, 2024 at 2:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

preparations:

build up the labels:

set labels:

Recommended topics

Hot tags