Let's say we have this DataFrame:
df = pd.DataFrame(columns=["value"], data=[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 5, 7, 10])
I want to split up the values into 5 bins based on their frequency, with the condition that no value is in more than one bin. You might say: "pd.qcut
is the way to go!" But the issue with qcut
is that it will either not make unique bins, or if you use duplicates="drop"
then it will decrease the number of bins you want. I am aware that a similar question has been asked here, but none of the answers seemed to directly answer that question, and also the question I'm asking is a bit more complex.
Here is what happens if we use pd.qcut
:
- Case 1:
df["bins"] = pd.qcut(df["value"], q=5, duplicates="drop")
: this gives us 3 bins instead of 5. - Case 2:
df["bins"] = pd.qcut(df["value"].rank(method='first'), q=5)
: this gives us 5 bins but the value of 0 is split across two bins, and so is the value of 1.
Here is what is desired:
If we run df["value"].value_counts(normalize=True)
, we see:
So from the percentage breakdown and the condition that each value is only in one bin, we would want bin 1 to contain all 0s (and only 0s), bin 2 to contain all 1s (and only 1s), bin 3 to contain all 2s (and only 2s), bin 4 to contain 5&7, and bin 5 to contain 7&10. This is because we wanted 5 bins, so each bin should have 20% of the values. But more than 20% of the values are 0, so 0 should be its own bin (and same for 1). And then once those two bins are assigned, there are 3 remaining bins to split up the rest of the values in. So each of those bins should contain 33% of the rest of the values, and more than 33% of the remaining values are 2, so 2 should be its own bin as well. And then finally that leaves 2 bins for the remaining values and those are split 50-50.
current_bin < total_bins
and that gives us 5 bins, but it makes the last value (10) be its in its own bin, whereas I think we want both 7 and 10 to be in the 5th bin. I think maybe what we need to do is changefraction
and(1 / total_bins)
to change based on the number of bins that have already been finalized. I.e. if 2 bins have already been finalized, then we calculate those values based on only 3 bins left. – Webbing