pandas using qcut on series with fewer values than quantiles

Asked 18/5, 2017 at 14:33 Answered 2/5, 2022 at 18:21

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):

>>> s = pd.Series([5, np.nan, np.nan])

When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)

>>> s.quantile([0.5, 1])
0.5    5.0
1.0    5.0
dtype: float64

But when I apply .qcut() with an integer value for number of quantiles an error is thrown:

>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5.,  5.,  5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

Even after I set the duplicates argument, it still fails:

>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0

How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)

The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:

0     (4.999, 5.000]
1                NaN
2                NaN

Motile answered 18/5, 2017 at 14:33 Comment(5)

What's your desired output for pd.qcut(s, 2)? You only have 1 unique value and why do you want to create more than 1 bins? – Pernell 18/5, 2017 at 18:44

I'm extracting a very specific case to address. In reality I have thousands of Series, all of which I need to cut. But qcut() runs into problem with an outlier row like this. I modified the question with the desired output. – Motile 19/5, 2017 at 14:43

surround the qcut with a try-except block to catch the faulty Series (Be specific enough to only get the ones too short) and deal with the ones too short sem-manually – Dunkle 19/5, 2017 at 14:52

did you manage to resolve this? I am getting the same error and can't find a solution – Strongminded 15/2, 2018 at 21:34

No, no solution is known to the original problem as of 2/21/2018 – Motile 21/2, 2018 at 20:57

Ok, this is a workaround which might work for you.

pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]: 
0    (4.999, 5.0]
1             NaN
2             NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]

Pernell answered 20/5, 2017 at 8:49 Comment(1)

Adding duplicates='drop' did it for me. – Ellmyer 8/2, 2023 at 14:29

You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)

#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)

#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')

Riva answered 2/5, 2022 at 18:21 Comment(0)

Recommended topics

Hot tags