Setting levels when creating a factor vs. `levels()<-`
Asked Answered
D

2

8

Let's create some factors first:

F1 <- factor(c(1,2,20,10,25,3))
F2 <- factor(paste0(F1, " years"))
F3 <- F2
levels(F3) <- paste0(sort(F1), " years")
F4 <- factor(paste0(F1, " years"), levels=paste0(sort(F1), " years"))

then take a look at them:

> F1
[1] 1  2  20 10 25 3 
Levels: 1 2 3 10 20 25

> F2
[1] 1 years  2 years  20 years 10 years 25 years 3 years 
Levels: 1 years 10 years 2 years 20 years 25 years 3 years

> F3
[1] 1 years  3 years  10 years 2 years  20 years 25 years
Levels: 1 years 2 years 3 years 10 years 20 years 25 years

> F4
[1] 1 years  2 years  20 years 10 years 25 years 3 years 
Levels: 1 years 2 years 3 years 10 years 20 years 25 years

First I note that the "expected" order of the levels in F2 is not similar to F1. Taking a look at factor documentation reveals why: the levels are created by first sorting the input. In the case of F2, these are the strings, where sorting takes length into account (?).

What is harder for me to understand is the difference in setting the levels between F3 and F4. In F3 I set the levels after the factor is created while in F4 I set them explicitly when creating the factor. In F3, the use of levels()<- isn't purely a relabel of the levels, but neither does it reorder them the way I expected.

Can someone explain the difference?

Diller answered 20/7, 2012 at 21:23 Comment(0)
F
10

F1 uses numeric sorting, as you figured out yourself.

F2 uses lexicographic sorting, first comparing the first character, breaking ties using the second, and so on, which is why "10 years" is between "1 years" and "2 years".

F4 is created from a character vector, but with an explicit list of possible factors. So that list is taken (without sorting) and identified with the numbers 1 through 6. Then every item of your input is compared against the set of possible levels, and the associated number is stored. After all, a factor is simply a bunch of numbers (as.numeric will show them to you) associated with a list of levels used for printing. So F4 gets printed just like F2, but its levels are sorted differently.

F3 was created from F2, so its levels were unsorted initially. The assignment only replaces the set of level names, not the numbers in the vector. So you can think of this as renaming existing levels. If you look at the numbers, they will match those from F2, whereas the names associated, and the order of names in particular, matches that from F4.

As your question claims that this was not purely a relabel: yes, it is a pure relabel, you obtain F3 from F2 using the following changes (in both rows of the printout):

  • 10 → 2
  • 2 → 3
  • 20 → 10
  • 25 → 20
  • 3 → 25

The str function is also a good tool to look at the internal representation of a factor.

Flatulent answered 20/7, 2012 at 22:14 Comment(0)
S
7

You created F2 from the following data:

> paste0(F1, " years")
[1] "1 years"  "2 years"  "20 years" "10 years" "25 years"
[6] "3 years"

Sorting the unique values to generate the levels results in the alphanumeric sorting that you mention

> levels(F2)
[1] "1 years"  "10 years" "2 years"  "20 years" "25 years"
[6] "3 years"

Hence, "2 years" is actually stored as a 3 - it is in the third category or level. Note that the this gives rise to a subtle difference in the way the data is stored in the factor:

> as.numeric(F1)
[1] 1 2 5 4 6 3
> as.numeric(F2)
[1] 1 3 4 2 5 6

When you now set the levels of F3 explicitly, you are passing in these values:

> paste0(sort(F1), " years")
[1] "1 years"  "2 years"  "3 years"  "10 years" "20 years"
[6] "25 years"

From above, the data were stored in F3 as:

> as.numeric(F3)
[1] 1 3 4 2 5 6

hence the 2nd element of F3 gets the third level that you specified; "3 years".

What levels<- does therefore is change the mapping between the numeric representation and the labels that are displayed. It most certainly is not a way to rearrange or relevel a factor which is what you appear to have expected. levels<- doesn't reorder the data either, it just changes the levels of the factor; the underlying numeric representation still holds and is thus mapped to the new levels.

In F4 you set the levels explicitly at create time, hence the data are stored numerically in the same way as with F1:

> F4 <- factor(paste0(F1, " years"), levels=paste0(sort(F1), " years"))
> as.numeric(F4)
[1] 1 2 5 4 6 3

It is the different underlying numeric representations of the data (or mapping to the original levels) of the individual data points that is causing the difference you see between F3 and F4.

I was bitten by this before and now know to watch for it but it does catch me out from time to time.

Squarerigged answered 20/7, 2012 at 22:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.