Pandas: bin column by frequency with unique bin intervals
Asked Answered
W

1

0

Let's say we have this DataFrame:

df = pd.DataFrame(columns=["value"], data=[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 5, 7, 10])

I want to split up the values into 5 bins based on their frequency, with the condition that no value is in more than one bin. You might say: "pd.qcut is the way to go!" But the issue with qcut is that it will either not make unique bins, or if you use duplicates="drop" then it will decrease the number of bins you want. I am aware that a similar question has been asked here, but none of the answers seemed to directly answer that question, and also the question I'm asking is a bit more complex.

Here is what happens if we use pd.qcut:

  1. Case 1: df["bins"] = pd.qcut(df["value"], q=5, duplicates="drop"): this gives us 3 bins instead of 5.
  2. Case 2: df["bins"] = pd.qcut(df["value"].rank(method='first'), q=5): this gives us 5 bins but the value of 0 is split across two bins, and so is the value of 1.

Here is what is desired:

If we run df["value"].value_counts(normalize=True), we see:

enter image description here

So from the percentage breakdown and the condition that each value is only in one bin, we would want bin 1 to contain all 0s (and only 0s), bin 2 to contain all 1s (and only 1s), bin 3 to contain all 2s (and only 2s), bin 4 to contain 5&7, and bin 5 to contain 7&10. This is because we wanted 5 bins, so each bin should have 20% of the values. But more than 20% of the values are 0, so 0 should be its own bin (and same for 1). And then once those two bins are assigned, there are 3 remaining bins to split up the rest of the values in. So each of those bins should contain 33% of the rest of the values, and more than 33% of the remaining values are 2, so 2 should be its own bin as well. And then finally that leaves 2 bins for the remaining values and those are split 50-50.

Webbing answered 27/4 at 12:50 Comment(0)
B
1

Your explanation is a little bit unclear but if I understand it, this is what you want to do: You want to calculate the frequency of each unique value in you df and determine the bin assignments based on these frequencies, ensuring that each unique value is assigned to only one bin. The last step being to create the bins while considering the distribution of the data points and the count required in each bin.

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=["value"], data=[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 5, 7, 10])

value_counts = df['value'].value_counts(normalize=True)
sorted_values = value_counts.sort_index()

total_bins = 5
bins = pd.Series(index=sorted_values.index, dtype='int32')

current_bin = 1
accumulated = 0 

for value, fraction in sorted_values.items():
    if accumulated + fraction > (1 / total_bins) and current_bin < total_bins:
        current_bin += 1  
        accumulated = fraction  
    else:
        accumulated += fraction
    bins[value] = current_bin

if current_bin < total_bins:
    bins = bins.fillna(total_bins)
df['bins'] = df['value'].map(bins)

print(df)
print("\nBinning assignment based on value:")
print(bins)

which will give you

    value  bins
0       0   2.0
1       0   2.0
2       0   2.0
3       0   2.0
4       0   2.0
5       1   3.0
6       1   3.0
7       1   3.0
8       1   3.0
9       1   3.0
10      2   4.0
11      2   4.0
12      2   4.0
13      3   5.0
14      5   5.0
15      7   5.0
16     10   5.0

Binning assignment based on value:
value
0     2.0
1     3.0
2     4.0
3     5.0
5     5.0
7     5.0
10    5.0
dtype: float64
Blotch answered 27/4 at 15:58 Comment(2)
Thank you Serge for submitting this. It seems to be going in the right direction. One issue is that the final output only has 4 bins instead of 5. I tried changing the code to current_bin < total_bins and that gives us 5 bins, but it makes the last value (10) be its in its own bin, whereas I think we want both 7 and 10 to be in the 5th bin. I think maybe what we need to do is change fraction and (1 / total_bins) to change based on the number of bins that have already been finalized. I.e. if 2 bins have already been finalized, then we calculate those values based on only 3 bins left.Webbing
sorry I meant current_bin <= total_binsWebbing

© 2022 - 2024 — McMap. All rights reserved.