Discretizing the log of a continuous variable
Asked Answered
H

1

6

I am trying to discretize a continuous variable, cutting it into three levels. I want to do the same thing for the log of the positive continuous variable (in this case, income).

require(dplyr)
set.seed(3)
mydata = data.frame(realinc = rexp(10000))

summary(mydata)

new = mydata %>% 
  select(realinc) %>%
  mutate(logrealinc = log(realinc),
         realincTercile = cut(realinc, 3),
         logrealincTercile = cut(logrealinc, 3),
         realincTercileNum = as.numeric(realincTercile),
         logrealincTercileNum = as.numeric(logrealincTercile)) 

new[sample(1:nrow(new), 10),]

I would have thought that using cut() would produce identical levels for the discretized factors of each of these variables (income and log income), because log is a monotone function. So the two columns on the right here should be equal, but that doesn't seem to happen. What's going on?

> new[sample(1:nrow(new), 10),]
       realinc  logrealinc  realincTercile logrealincTercile realincTercileNum logrealincTercileNum
7931 0.2967813 -1.21475972 (-0.00805,2.83]     (-4.43,-1.15]                 1                    2
9036 0.9511824 -0.05004944 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
8204 4.5365676  1.51217069     (2.83,5.66]      (-1.15,2.15]                 2                    3
3136 2.0610693  0.72322490 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
9708 0.9655805 -0.03502581 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
5942 0.9149351 -0.08890215 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
4631 0.6987581 -0.35845064 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7309 1.9532566  0.66949804 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7708 0.4220254 -0.86268973 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
2965 1.3690976  0.31415186 (-0.00805,2.83]      (-1.15,2.15]                 1                    3

Edit: @nicola's comment explains the source of the problem. It seems that in cut's documentation, "equal-length intervals" refers to the length of the interval in the space of the continuous argument. I had originally interpreted "equal-length intervals" as meaning the number of elements assigned to each cut (on the output) would be equal (instead of the input).

Is there a function that does what I'm describing? -- where the number of elements in each output level are equal? Equivalently, where the levels of newfunc(realinc) and newfunc(logrealinc) are equal?

Hargreaves answered 13/4, 2016 at 4:55 Comment(4)
log is not a linear transformation. Say that x is uniformly distributed between 1 and 5. Do you expect that log(x) is uniformly distributed between log(1) and log(5)? In your example, try hist(new$realinc) and hist(new$logrealinc) to see how they differ. cut just cuts the entire range in basically constant intervals; an element can well fall into an interval and its log into another.Temple
@Temple Thanks, that's helpful. I've updated the question with that in mind.Hargreaves
You can search for split vector into equal chunksLimelight
#3318833Limelight
T
5

If you want your levels to be equally populated, take a look at the quantile function. Try for instance:

x<-cut(new$realinc,quantile(new$realinc,0:3/3))
y<-cut(new$logrealinc,quantile(new$logrealinc,0:3/3))
all(as.integer(x)==as.integer(y),na.rm=TRUE)
#[1] TRUE
table(x)
#x
#(0.000444,0.396]     (0.396,1.12]      (1.12,8.49] 
#            3333             3333             3333
Temple answered 13/4, 2016 at 5:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.