remove cases, all id's, for cases where NO changes have occured between time 1 and time 2
Asked Answered
Z

5

5

I have two data frames with a range of cases. One from time 1 and one from time 2. I am looking for a way to quickly identify cases where changes have occurred between time1 and time 2 and I am kinda stuck.

Here is an example. So, I have a data frame from time 1,

df.t1 <- data.frame(id = c(1,1,1,2,2,3,3,5,5,6), ABC = LETTERS[1:10], Num = 101:110)

and it looks like this,

df.t1
   id ABC Num
1   1   A 101
2   1   B 102
3   1   C 103
4   2   D 104
5   2   E 105
6   3   F 106
7   3   G 107
8   5   H 108
9   5   I 109
10  6   J 110

time two rolls around

df.t2 <- df.t1

and some changes occur,

df.t2[3,3] <- 104
df.t2[2,2] <- "H"
df.t2[8,3] <- 999
df.t2[10,3] <- NA
df.t2[11,] <- c(3, "J", 107)

this is time 2,

df.t2
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
4   2   D  104
5   2   E  105
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

I'm now looking for a quick way to remove cases, all id's, for cases where NO changes have occurred in the case (any row) between time1 and time 2. In the specific example it is only with id # 2 that no changes occurred between time 1 and time 2.

I am looking for an end result that looks like this,

(df <- subset(df.t2, id != 2))
   id ABC  Num
1   1   A  101
2   1   H  102
3   1   C  104
6   3   F  106
7   3   G  107
8   5   H  999
9   5   I  109
10  6   J <NA>
11  3   J  107

any help would be appreciated.

Zoa answered 14/2, 2013 at 0:25 Comment(3)
At id=3 there is a change (a new entry). You don't consider them?Higher
@Arun, my bad. I reversed the point, I'll update the question. Thanks!Zoa
To add some extra clarification, I want to keep cases in df where changes have occurred between time 1 and time 2. Please let me know if it's still unclear!Zoa
S
2

I really liked @alexwhan's answer here https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2, but it sounded like the combining was slow because of many columns and long character strings. I wonder if numerically getting a unique number for each combination would be faster.

# get a matrix of unique integers for each column (stacking the two data frames)
ms <- do.call(cbind, lapply(seq_len(ncol(df.t1)), function(ni) {
  xi <- c(as.character(df.t1[[ni]]), as.character(df.t2[[ni]]))
  match(xi, unique(xi))
}))
# convert to base max(ms) to get a single unique identifying number
us <- as.vector(ms %*% max(ms)^c(0:(ncol(ms)-1)))
u2 <- us[(nrow(df.t1)+1):length(us)]
u1 <- us[1:nrow(df.t1)]
# now get changed values and proceed as in alexwhan's answer
ch <- unique(df.t2$id[! u2 %in% u1])
df.t2[df.t2$id %in% ch,]

Here's a slightly different way to get the us variable, I expect it's a little slower but takes more care to keep everything as integers instead of floating-point numerics so uniqueness should be guaranteed and I believe that any floating-point overflow will cause a warning. (I also subtract one from ms as it's still unique and everything is just a little smaller.)

base <- as.integer(max(ms)^c(0:(nrow(ms)-1)))
us <- apply((ms-1L) * base, 2, sum)
Shedd answered 14/2, 2013 at 3:30 Comment(8)
This solution is the first that actually works with my data. Very grateful that you found this solution. I don't fully understand it yet, but I appreciate that it's there. Thanks!Zoa
Especially with R, a good solution depends on the size of your data set, so while coming up with a small reproducible example is always recommended, it can also really help to state the size of the data set; in this case, not only how many rows and columns, but also how many ids and how many unique things are in each column. (That last would perhaps affect my solution; I worry a little bit about integer overflow.)Shedd
Thank you for following up with some really sound advice. I will add that information to my post later today (and in future questions). Do you have any good sources where I can learn more about integer overflow?Zoa
#8805279Shedd
PS. Did @alexwhan's answer work better with paste instead of interact? If that works this complication is unnecessary...Shedd
The paste doesn't really work in my case. Your tweak of us did not work for me, so I am using your solution as is–and so far it works great, but I think I should keep looking for better solution jf. your waring about integer overflow.Zoa
Well, I don't think anyone here can help with a better solution unless we know more about your specific requirements. I'd suggest creating a data set with random data (using set.seed) that is similar to yours, and starting a new question, referencing this one and asking for ideas on what to do given the larger data set.Shedd
That is an excellent advice. I'll do that ASAP. Thank you for all your help and advice.Zoa
H
3

I think this would work. A solution using data.table:

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC")
setkey(dt2, "id", "ABC")
dt <- dt1[dt2]
dt2[id %in% dt[, !(all(Num == Num.1)) | any(is.na(c(Num, Num.1))), by=id][V1 == TRUE]$id]

#    id ABC Num
# 1:  1   A 101
# 2:  1   C 104
# 3:  1   H 102
# 4:  3   F 106
# 5:  3   G 107
# 6:  3   J 107
# 7:  5   H 999
# 8:  5   I 109
# 9:  6   J  NA

Alternatively, after obtaining dt = dt1[dt2]:

dt2[id %in% dt[, ctr := Num %in% Num.1, by=1:nrow(dt)][ctr == FALSE, unique(id)]]
Higher answered 14/2, 2013 at 0:52 Comment(3)
Thanks! I cold never have come up with that on my own. Again, the issue is that the real data have a lot of columns and it seems as I would have to name them all in this function. Regardless, you answer my question.Zoa
So, you would like "any" of the columns that are changed to be selected from df.t2?? If that's the case, I'll edit this post later during the day.Higher
What I am looking for is any row that is changed between time 1 and time 2, and if a row is changed I want all rows relating to that id. I am trying to illustrate this in the last part of my question.Zoa
M
3

What about using data.table and keying by all of id, ABC and NUM

require(data.table)
dt1 <- data.table(df.t1)
dt2 <- data.table(df.t2)
# your conversion made them characters
dt2[, `:=`(id = as.numeric(id), Num = as.numeric(Num))]
setkey(dt1, "id", "ABC", "Num")
setkey(dt2, "id", "ABC", "Num")
# then it is just

dt2[dt2[!dt1][,list(unique(id))]]


   id ABC Num
1:  1   A 101
2:  1   C 104
3:  1   H 102
4:  3   F 106
5:  3   G 107
6:  3   J 107
7:  5   H 999
8:  5   I 109
9:  6   J  NA

This uses a non-join, between dt2 and dt1, selects the unique id values from these, and then subsets the dt2 data appropriately.

with a caveat about NA values in keys being problematic....

Mabe answered 14/2, 2013 at 1:12 Comment(5)
Yes, but that is a merge. He needs those that are not equal between the two. For ex: you get here id=2 which shouldn't be there.Higher
Good point @Arun, i've had a rethink, and edit. I can't make the NA cause an issue, but it might in certain situations.Mabe
(+1) Brilliant! I came "close" with dt2[!dt1] but dint think of this!! I was sure there must be an equivalent of setdiff(for vectors) in data.table.Higher
@mnel, does this entail that I would need to set all 34 column names as setkey in my real data?Zoa
Yes, you could do setkeyv(DT, names(DT)) to avoid having to spell them all out (and ensuring that id is column number 1)Mabe
A
2

Try this:

df.t1$interact <- with(df.t1, interaction(id, ABC, Num))
df.t2$interact <- with(df.t2, interaction(id, ABC, Num))

change.ids <- unique(df.t2$id[!df.t2$interact %in% df.t1$interact])
new.df <- df.t2[df.t2$id %in% change.ids,]
Athodyd answered 14/2, 2013 at 0:40 Comment(9)
Thank you for responding to my question. It seems to work after your update, only problem is that I have to define id, ABC, and Num in interaction(), but I guess I can use names().Zoa
What do you want to happen? How is your data structured?Athodyd
You do solve the problem, the issue is that in the real data I have 34 columns and some of them have some text in them so pasting it together to i interact might not be the most optimal solution. Can you think of a way where I can compare it row by row?Zoa
If it's manually naming the columns you're worried about, this might work: df.t1$interact <-with(df.t1, interaction((df.t1[,1:ncol(df.t1)]))) but it's gonna be uglyAthodyd
I like the `interaction((df.t1[,1:ncol(df.t1)])). It might be ugly, but it makes it more flexible.Zoa
I'm trying this solution, but it's killing my computer. I believe it's the combining everything into one variable. I think I need a solution where i do it row by row.Zoa
OK, have a go at the looping idea I posted belowAthodyd
Is it faster if you use paste instead of interaction? interaction is making many more factor levels than you need.Shedd
Also, perhaps generating a unique number would be faster? See my attempt here: https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2Shedd
S
2

I really liked @alexwhan's answer here https://mcmap.net/q/1936582/-remove-cases-all-id-39-s-for-cases-where-no-changes-have-occured-between-time-1-and-time-2, but it sounded like the combining was slow because of many columns and long character strings. I wonder if numerically getting a unique number for each combination would be faster.

# get a matrix of unique integers for each column (stacking the two data frames)
ms <- do.call(cbind, lapply(seq_len(ncol(df.t1)), function(ni) {
  xi <- c(as.character(df.t1[[ni]]), as.character(df.t2[[ni]]))
  match(xi, unique(xi))
}))
# convert to base max(ms) to get a single unique identifying number
us <- as.vector(ms %*% max(ms)^c(0:(ncol(ms)-1)))
u2 <- us[(nrow(df.t1)+1):length(us)]
u1 <- us[1:nrow(df.t1)]
# now get changed values and proceed as in alexwhan's answer
ch <- unique(df.t2$id[! u2 %in% u1])
df.t2[df.t2$id %in% ch,]

Here's a slightly different way to get the us variable, I expect it's a little slower but takes more care to keep everything as integers instead of floating-point numerics so uniqueness should be guaranteed and I believe that any floating-point overflow will cause a warning. (I also subtract one from ms as it's still unique and everything is just a little smaller.)

base <- as.integer(max(ms)^c(0:(nrow(ms)-1)))
us <- apply((ms-1L) * base, 2, sum)
Shedd answered 14/2, 2013 at 3:30 Comment(8)
This solution is the first that actually works with my data. Very grateful that you found this solution. I don't fully understand it yet, but I appreciate that it's there. Thanks!Zoa
Especially with R, a good solution depends on the size of your data set, so while coming up with a small reproducible example is always recommended, it can also really help to state the size of the data set; in this case, not only how many rows and columns, but also how many ids and how many unique things are in each column. (That last would perhaps affect my solution; I worry a little bit about integer overflow.)Shedd
Thank you for following up with some really sound advice. I will add that information to my post later today (and in future questions). Do you have any good sources where I can learn more about integer overflow?Zoa
#8805279Shedd
PS. Did @alexwhan's answer work better with paste instead of interact? If that works this complication is unnecessary...Shedd
The paste doesn't really work in my case. Your tweak of us did not work for me, so I am using your solution as is–and so far it works great, but I think I should keep looking for better solution jf. your waring about integer overflow.Zoa
Well, I don't think anyone here can help with a better solution unless we know more about your specific requirements. I'd suggest creating a data set with random data (using set.seed) that is similar to yours, and starting a new question, referencing this one and asking for ideas on what to do given the larger data set.Shedd
That is an excellent advice. I'll do that ASAP. Thank you for all your help and advice.Zoa
A
1

OK, here's an alternative with some looping. I haven't tried it on a more complex example, but have a look:

no.change <- vector()
#identifies rows which don't change
for(i in 1:nrow(df.t2)) { 
    for(k in 1:nrow(df.t1)) {
    if(all(df.t2[i,]==df.t1[k,],na.rm=TRUE)) #na.rm gets rid of error
   no.change <- c(no.change, i)
  }
}
# gets ids from all the other rows
id.change <- df.t2$id[-no.change]
df <- df.t2[df.t2$id %in% id.change,]
Athodyd answered 14/2, 2013 at 1:45 Comment(5)
Thanks for giving it another go!Zoa
It get this error Error in if (all(df.t2[i, ] == df.t1[k, ])) no.change <- c(no.change, : missing value where TRUE/FALSE neededZoa
This falls into the 2nd circle of hell. Never grow objects in that way!Mabe
I know I know (you should see some of the truuuuly ugly stuff) - all too often "it just works for right now" makes the gradeAthodyd
@alexwhan, but … well. I don't know how I feel about solving something with code that returns an error. Regardless, you are right.Zoa

© 2022 - 2024 — McMap. All rights reserved.