How to assign a counter to a specific subset of a data.frame which is defined by a factor combination?
Asked Answered
D

4

7

My question is: I have a data frame with some factor variables. I now want to assign a new vector to this data frame, which creates an index for each subset of those factor variables.

   data <-data.frame(fac1=factor(rep(1:2,5)), fac2=sample(letters[1:3],10,rep=T))

Gives me something like:

        fac1 fac2
     1     1    a
     2     2    c
     3     1    b
     4     2    a
     5     1    c
     6     2    b
     7     1    a
     8     2    a
     9     1    b
     10    2    c

And what I want is a combination counter which counts the occurrence of each factor combination. Like this

        fac1 fac2  counter
     1     1    a        1
     2     2    c        1
     3     1    b        1
     4     2    a        1
     5     1    c        1
     6     2    b        1
     7     1    a        2
     8     2    a        2
     9     1    b        2
     10    1    a        3

So far I thought about using tapply to get the counter over all factor-combinations, which works fine

counter <-tapply(data$fac1, list(data$fac1,data$fac2), function(x) 1:length(x))

But I do not know how I can assign the counter list (e.g. unlisted) to the combinations in the data-frame without using inefficient looping :)

Datestamp answered 25/10, 2012 at 15:10 Comment(4)
Does it need to be in order or do you just want net counts? If you just want counts, table(paste(data$fac1,data$fac2,sep="-")) might help.Pesach
Hi! Within each fac1 x fac2 combination the order matters. (One can think of it as times a person "fac1" saw the letter "fac2")Datestamp
You could use the same basic strategy, but switch from tapply to either ddply from plyr, or if your data is huge and performance is an issue, data.table.Skipjack
possible duplicate of numbering rows within groups in a data frameHayton
M
6

This is a job for the ave() function:

# Use set.seed for reproducible examples 
#   when random number generation is involved
set.seed(1) 
myDF <- data.frame(fac1 = factor(rep(1:2, 7)), 
                   fac2 = sample(letters[1:3], 14, replace = TRUE), 
                   stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
#    fac1 fac2 counter
# 1     1    a       1
# 2     2    b       1
# 3     1    b       1
# 4     2    c       1
# 5     1    a       2
# 6     2    c       2
# 7     1    c       1
# 8     2    b       2
# 9     1    b       2
# 10    2    a       1
# 11    1    a       3
# 12    2    a       2
# 13    1    c       2
# 14    2    b       3

Note the use of stringsAsFactors=FALSE in the data.frame() step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along).

Misconception answered 25/10, 2012 at 15:53 Comment(1)
Compared mrdwab and my solution in terms of efficiency (could not get @mplourde to work) and the mrdwab is twice as fast. For 1000000 lines it is 1.693 vs. 3.382 secCormack
H
4

A data.table solution

library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
Hayton answered 25/10, 2012 at 22:35 Comment(0)
R
0

This is a base R way that avoids (explicit) looping.

data$counter <- with(data, {
    inter <- as.character(interaction(fac1, fac2))
    names(inter) <- seq_along(inter)
    inter.ordered <- inter[order(inter)]
    counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
    counter[match(names(inter), names(inter.ordered))]
})
Roentgenotherapy answered 25/10, 2012 at 15:42 Comment(0)
C
0

Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):

x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
  x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]

No idea whether this is efficient over a large data.frame but it works!

Cormack answered 25/10, 2012 at 15:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.