remove cases, all id's, for cases where NO changes have occured between time 1 and time 2

Z

5

I have two data frames with a range of cases. One from time 1 and one from time 2. I am looking for a way to quickly identify cases where changes have occurred between time1 and time 2 and I am kinda stuck.

Here is an example. So, I have a data frame from time 1,

df.t1 <- data.frame(id = c(1,1,1,2,2,3,3,5,5,6), ABC = LETTERS[1:10], Num = 101:110)

and it looks like this,

df.t1
   id ABC Num
1   1   A 101
2   1   B 102
3   1   C 103
4   2   D 104
5   2   E 105
6   3   F 106
7   3   G 107
8   5   H 108
9   5   I 109
10  6   J 110

time two rolls around

df.t2 <- df.t1

and some changes occur,

df.t2[3,3] <- 104
df.t2[2,2] <- "H"
df.t2[8,3] <- 999
df.t2[10,3] <- NA
df.t2[11,] <- c(3, "J", 107)

this is time 2,

df.t2
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
4   2   D  104
5   2   E  105
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

I'm now looking for a quick way to remove cases, all id's, for cases where NO changes have occurred in the case (any row) between time1 and time 2. In the specific example it is only with id # 2 that no changes occurred between time 1 and time 2.

I am looking for an end result that looks like this,

(df <- subset(df.t2, id != 2))
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

any help would be appreciated.

Zoa answered 14/2, 2013 at 0:25 Comment(3)

At id=3 there is a change (a new entry). You don't consider them? – Higher 14/2, 2013 at 0:30

@Arun, my bad. I reversed the point, I'll update the question. Thanks! – Zoa 14/2, 2013 at 0:32

To add some extra clarification, I want to keep cases in df where changes have occurred between time 1 and time 2. Please let me know if it's still unclear! – Zoa 14/2, 2013 at 0:38

S

2

I really liked @alexwhan's answer here https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2, but it sounded like the combining was slow because of many columns and long character strings. I wonder if numerically getting a unique number for each combination would be faster.

# get a matrix of unique integers for each column (stacking the two data frames)
ms <- do.call(cbind, lapply(seq_len(ncol(df.t1)), function(ni) {
  xi <- c(as.character(df.t1[[ni]]), as.character(df.t2[[ni]]))
  match(xi, unique(xi))
}))
# convert to base max(ms) to get a single unique identifying number
us <- as.vector(ms %*% max(ms)^c(0:(ncol(ms)-1)))
u2 <- us[(nrow(df.t1)+1):length(us)]
u1 <- us[1:nrow(df.t1)]
# now get changed values and proceed as in alexwhan's answer
ch <- unique(df.t2$id[! u2 %in% u1])
df.t2[df.t2$id %in% ch,]

Here's a slightly different way to get the us variable, I expect it's a little slower but takes more care to keep everything as integers instead of floating-point numerics so uniqueness should be guaranteed and I believe that any floating-point overflow will cause a warning. (I also subtract one from ms as it's still unique and everything is just a little smaller.)

base <- as.integer(max(ms)^c(0:(nrow(ms)-1)))
us <- apply((ms-1L) * base, 2, sum)

Shedd answered 14/2, 2013 at 3:30 Comment(8)

This solution is the first that actually works with my data. Very grateful that you found this solution. I don't fully understand it yet, but I appreciate that it's there. Thanks! – Zoa 14/2, 2013 at 5:18

Especially with R, a good solution depends on the size of your data set, so while coming up with a small reproducible example is always recommended, it can also really help to state the size of the data set; in this case, not only how many rows and columns, but also how many ids and how many unique things are in each column. (That last would perhaps affect my solution; I worry a little bit about integer overflow.) – Shedd 14/2, 2013 at 14:2

Thank you for following up with some really sound advice. I will add that information to my post later today (and in future questions). Do you have any good sources where I can learn more about integer overflow? – Zoa 14/2, 2013 at 14:39

#8805279 – Shedd 14/2, 2013 at 14:45

PS. Did @alexwhan's answer work better with paste instead of interact? If that works this complication is unnecessary... – Shedd 14/2, 2013 at 15:52

The paste doesn't really work in my case. Your tweak of us did not work for me, so I am using your solution as is–and so far it works great, but I think I should keep looking for better solution jf. your waring about integer overflow. – Zoa 14/2, 2013 at 20:7

Well, I don't think anyone here can help with a better solution unless we know more about your specific requirements. I'd suggest creating a data set with random data (using set.seed) that is similar to yours, and starting a new question, referencing this one and asking for ideas on what to do given the larger data set. – Shedd 14/2, 2013 at 20:23

That is an excellent advice. I'll do that ASAP. Thank you for all your help and advice. – Zoa 14/2, 2013 at 20:41

H

3

I think this would work. A solution using data.table:

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC")
setkey(dt2, "id", "ABC")
dt <- dt1[dt2]
dt2[id %in% dt[, !(all(Num == Num.1)) | any(is.na(c(Num, Num.1))), by=id][V1 == TRUE]$id]

#    id ABC Num
# 1:  1   A 101
# 2:  1   C 104
# 3:  1   H 102
# 4:  3   F 106
# 5:  3   G 107
# 6:  3   J 107
# 7:  5   H 999
# 8:  5   I 109
# 9:  6   J  NA

Alternatively, after obtaining dt = dt1[dt2]:

dt2[id %in% dt[, ctr := Num %in% Num.1, by=1:nrow(dt)][ctr == FALSE, unique(id)]]

Higher answered 14/2, 2013 at 0:52 Comment(3)

Thanks! I cold never have come up with that on my own. Again, the issue is that the real data have a lot of columns and it seems as I would have to name them all in this function. Regardless, you answer my question. – Zoa 14/2, 2013 at 0:58

So, you would like "any" of the columns that are changed to be selected from df.t2?? If that's the case, I'll edit this post later during the day. – Higher 14/2, 2013 at 1:7

What I am looking for is any row that is changed between time 1 and time 2, and if a row is changed I want all rows relating to that id. I am trying to illustrate this in the last part of my question. – Zoa 14/2, 2013 at 1:30

M

3

What about using data.table and keying by all of id, ABC and NUM

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC", "Num")
setkey(dt2, "id", "ABC", "Num")
# then it is just

dt2[dt2[!dt1][,list(unique(id))]]


   id ABC Num
1:  1   A 101
2:  1   C 104
3:  1   H 102
4:  3   F 106
5:  3   G 107
6:  3   J 107
7:  5   H 999
8:  5   I 109
9:  6   J  NA

This uses a non-join, between dt2 and dt1, selects the unique id values from these, and then subsets the dt2 data appropriately.

with a caveat about NA values in keys being problematic....

Mabe answered 14/2, 2013 at 1:12 Comment(5)

Yes, but that is a merge. He needs those that are not equal between the two. For ex: you get here id=2 which shouldn't be there. – Higher 14/2, 2013 at 1:14

Good point @Arun, i've had a rethink, and edit. I can't make the NA cause an issue, but it might in certain situations. – Mabe 14/2, 2013 at 1:35

(+1) Brilliant! I came "close" with dt2[!dt1] but dint think of this!! I was sure there must be an equivalent of setdiff(for vectors) in data.table. – Higher 14/2, 2013 at 1:55

@mnel, does this entail that I would need to set all 34 column names as setkey in my real data? – Zoa 14/2, 2013 at 1:56

Yes, you could do setkeyv(DT, names(DT)) to avoid having to spell them all out (and ensuring that id is column number 1) – Mabe 14/2, 2013 at 1:58

A

2

Try this:

df.t1$interact <- with(df.t1, interaction(id, ABC, Num))
df.t2$interact <- with(df.t2, interaction(id, ABC, Num))

change.ids <- unique(df.t2$id[!df.t2$interact %in% df.t1$interact])
new.df <- df.t2[df.t2$id %in% change.ids,]

Athodyd answered 14/2, 2013 at 0:40 Comment(9)

Thank you for responding to my question. It seems to work after your update, only problem is that I have to define id, ABC, and Num in interaction(), but I guess I can use names(). – Zoa 14/2, 2013 at 0:46

What do you want to happen? How is your data structured? – Athodyd 14/2, 2013 at 0:47

You do solve the problem, the issue is that in the real data I have 34 columns and some of them have some text in them so pasting it together to i interact might not be the most optimal solution. Can you think of a way where I can compare it row by row? – Zoa 14/2, 2013 at 0:52

If it's manually naming the columns you're worried about, this might work: df.t1$interact <-with(df.t1, interaction((df.t1[,1:ncol(df.t1)]))) but it's gonna be ugly – Athodyd 14/2, 2013 at 1:20

I like the `interaction((df.t1[,1:ncol(df.t1)])). It might be ugly, but it makes it more flexible. – Zoa 14/2, 2013 at 1:26

I'm trying this solution, but it's killing my computer. I believe it's the combining everything into one variable. I think I need a solution where i do it row by row. – Zoa 14/2, 2013 at 1:43

OK, have a go at the looping idea I posted below – Athodyd 14/2, 2013 at 1:46

Is it faster if you use paste instead of interaction? interaction is making many more factor levels than you need. – Shedd 14/2, 2013 at 3:28

Also, perhaps generating a unique number would be faster? See my attempt here: https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2 – Shedd 14/2, 2013 at 3:32

S

2

I really liked @alexwhan's answer here https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2, but it sounded like the combining was slow because of many columns and long character strings. I wonder if numerically getting a unique number for each combination would be faster.

# get a matrix of unique integers for each column (stacking the two data frames)
ms <- do.call(cbind, lapply(seq_len(ncol(df.t1)), function(ni) {
  xi <- c(as.character(df.t1[[ni]]), as.character(df.t2[[ni]]))
  match(xi, unique(xi))
}))
# convert to base max(ms) to get a single unique identifying number
us <- as.vector(ms %*% max(ms)^c(0:(ncol(ms)-1)))
u2 <- us[(nrow(df.t1)+1):length(us)]
u1 <- us[1:nrow(df.t1)]
# now get changed values and proceed as in alexwhan's answer
ch <- unique(df.t2$id[! u2 %in% u1])
df.t2[df.t2$id %in% ch,]

Here's a slightly different way to get the us variable, I expect it's a little slower but takes more care to keep everything as integers instead of floating-point numerics so uniqueness should be guaranteed and I believe that any floating-point overflow will cause a warning. (I also subtract one from ms as it's still unique and everything is just a little smaller.)

base <- as.integer(max(ms)^c(0:(nrow(ms)-1)))
us <- apply((ms-1L) * base, 2, sum)

Shedd answered 14/2, 2013 at 3:30 Comment(8)

This solution is the first that actually works with my data. Very grateful that you found this solution. I don't fully understand it yet, but I appreciate that it's there. Thanks! – Zoa 14/2, 2013 at 5:18

Especially with R, a good solution depends on the size of your data set, so while coming up with a small reproducible example is always recommended, it can also really help to state the size of the data set; in this case, not only how many rows and columns, but also how many ids and how many unique things are in each column. (That last would perhaps affect my solution; I worry a little bit about integer overflow.) – Shedd 14/2, 2013 at 14:2

Thank you for following up with some really sound advice. I will add that information to my post later today (and in future questions). Do you have any good sources where I can learn more about integer overflow? – Zoa 14/2, 2013 at 14:39

#8805279 – Shedd 14/2, 2013 at 14:45

PS. Did @alexwhan's answer work better with paste instead of interact? If that works this complication is unnecessary... – Shedd 14/2, 2013 at 15:52

The paste doesn't really work in my case. Your tweak of us did not work for me, so I am using your solution as is–and so far it works great, but I think I should keep looking for better solution jf. your waring about integer overflow. – Zoa 14/2, 2013 at 20:7

Well, I don't think anyone here can help with a better solution unless we know more about your specific requirements. I'd suggest creating a data set with random data (using set.seed) that is similar to yours, and starting a new question, referencing this one and asking for ideas on what to do given the larger data set. – Shedd 14/2, 2013 at 20:23

That is an excellent advice. I'll do that ASAP. Thank you for all your help and advice. – Zoa 14/2, 2013 at 20:41

A

1

OK, here's an alternative with some looping. I haven't tried it on a more complex example, but have a look:

no.change <- vector()
#identifies rows which don't change
for(i in 1:nrow(df.t2)) { 
    for(k in 1:nrow(df.t1)) {
    if(all(df.t2[i,]==df.t1[k,],na.rm=TRUE)) #na.rm gets rid of error
   no.change <- c(no.change, i)
  }
}
# gets ids from all the other rows
id.change <- df.t2$id[-no.change]
df <- df.t2[df.t2$id %in% id.change,]

Athodyd answered 14/2, 2013 at 1:45 Comment(5)

Thanks for giving it another go! – Zoa 14/2, 2013 at 1:49

It get this error Error in if (all(df.t2[i, ] == df.t1[k, ])) no.change <- c(no.change, : missing value where TRUE/FALSE needed – Zoa 14/2, 2013 at 1:50

This falls into the 2nd circle of hell. Never grow objects in that way! – Mabe 14/2, 2013 at 1:53

I know I know (you should see some of the truuuuly ugly stuff) - all too often "it just works for right now" makes the grade – Athodyd 14/2, 2013 at 1:54

@alexwhan, but … well. I don't know how I feel about solving something with code that returns an error. Regardless, you are right. – Zoa 14/2, 2013 at 1:58

Recommended topics

Hot tags