I am trying to discretize a continuous variable, cutting it into three levels. I want to do the same thing for the log of the positive continuous variable (in this case, income).
require(dplyr)
set.seed(3)
mydata = data.frame(realinc = rexp(10000))
summary(mydata)
new = mydata %>%
select(realinc) %>%
mutate(logrealinc = log(realinc),
realincTercile = cut(realinc, 3),
logrealincTercile = cut(logrealinc, 3),
realincTercileNum = as.numeric(realincTercile),
logrealincTercileNum = as.numeric(logrealincTercile))
new[sample(1:nrow(new), 10),]
I would have thought that using cut()
would produce identical levels for the discretized factors of each of these variables (income and log income), because log is a monotone function. So the two columns on the right here should be equal, but that doesn't seem to happen. What's going on?
> new[sample(1:nrow(new), 10),]
realinc logrealinc realincTercile logrealincTercile realincTercileNum logrealincTercileNum
7931 0.2967813 -1.21475972 (-0.00805,2.83] (-4.43,-1.15] 1 2
9036 0.9511824 -0.05004944 (-0.00805,2.83] (-1.15,2.15] 1 3
8204 4.5365676 1.51217069 (2.83,5.66] (-1.15,2.15] 2 3
3136 2.0610693 0.72322490 (-0.00805,2.83] (-1.15,2.15] 1 3
9708 0.9655805 -0.03502581 (-0.00805,2.83] (-1.15,2.15] 1 3
5942 0.9149351 -0.08890215 (-0.00805,2.83] (-1.15,2.15] 1 3
4631 0.6987581 -0.35845064 (-0.00805,2.83] (-1.15,2.15] 1 3
7309 1.9532566 0.66949804 (-0.00805,2.83] (-1.15,2.15] 1 3
7708 0.4220254 -0.86268973 (-0.00805,2.83] (-1.15,2.15] 1 3
2965 1.3690976 0.31415186 (-0.00805,2.83] (-1.15,2.15] 1 3
Edit: @nicola's comment explains the source of the problem. It seems that in cut
's documentation, "equal-length intervals" refers to the length of the interval in the space of the continuous argument. I had originally interpreted "equal-length intervals" as meaning the number of elements assigned to each cut (on the output) would be equal (instead of the input).
Is there a function that does what I'm describing? -- where the number of elements in each output level are equal? Equivalently, where the levels of newfunc(realinc)
and newfunc(logrealinc)
are equal?
log
is not a linear transformation. Say thatx
is uniformly distributed between 1 and 5. Do you expect thatlog(x)
is uniformly distributed betweenlog(1)
andlog(5)
? In your example, tryhist(new$realinc)
andhist(new$logrealinc)
to see how they differ.cut
just cuts the entire range in basically constant intervals; an element can well fall into an interval and its log into another. – Temple