Remove decimal points from pandas qcut intervals (transform intervals to integers)
Asked Answered
K

2

5

I have many scores in the column of an object named example. I want to split these scores into deciles and assign the corresponding decile interval to each row. I tried the following:

import random
import pandas as pd
random.seed(420) #blazeit
example = pd.DataFrame({"Score":[random.randrange(350, 1000) for i in range(1000)]})
example["Decile"] = pd.qcut(example["Score"], 10, labels=False) + 1 # Deciles as integer from 1 to 10
example["Decile_interval"] = pd.qcut(example["Score"], 10) # Decile as interval

This gives me the deciles I'm looking for. However, I would like the deciles in example["Decile_interval"] to be integers, not floats. I tried precision=0 but it just shows .0 at the end of each number.

How can I transform the floats in the intervals to integers?

EDIT: As pointet out by @ALollz, doing this will change the decile distribution. However, I am doing this for presentation purposes, so I am not worried by this. Props to @JuanC for realizing this and posting one solution.

Kenaz answered 9/9, 2019 at 15:9 Comment(2)
Well if you round the endpoints to integers you'll no longer have deciles... So what's more important?Quito
@Quito I'd rather have rounded intervals than exact deciles. An alternative would be to create a new column that simply printed the intervals as integers while keeping the true values in the original column.Kenaz
V
6

This is my solution using a simple apply function:

example["Decile_interval"] = example["Decile_interval"].apply(lambda x: pd.Interval(left=int(round(x.left)), right=int(round(x.right))))
Vie answered 9/9, 2019 at 15:40 Comment(0)
T
2

There might be a better solution, but this works:

import numpy as np

int_categories= [pd.Interval(int(np.round(i.left)),int(np.round(i.right))) for i in example.Decile_interval.cat.categories]
example.Decile_interval.cat.categories = int_categories

Output:

0      (350, 418]
1      (680, 740]
2      (606, 680]
3      (740, 798]
4      (418, 474]
5      (418, 474]
.           .
Telfore answered 9/9, 2019 at 15:27 Comment(3)
The only issue is that pd.qcut is slightly smarter and knows to change the left most bin to be 349.999, that way 350 gets grouped and not excluded.Quito
It seems this change is mostly for presentation purposes so total accuracy of the intervals isn't very relevant to OP, but that's a good point neverthelessTelfore
@Quito That's right, this is more for presentation purposes. This solution works.Kenaz

© 2022 - 2024 — McMap. All rights reserved.