R combining duplicate rows in a time series with different column types in a datatable
Asked Answered
P

2

9

This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time and some other columns of different types (factors and numerics). Here is an example:

dt <- data.table(time  = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
             abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
             farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
             gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
             goff  = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
             huft  = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
             mode  = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))

Now I want to combine the duplicate times in column time. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

dtRes <- data.table(time  = c(1, 1, 1, 2, 3, 4, 4),
                abst  = c(1, 1, 1, 0, 0, 3, 3),
                farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                mode  = c(4, 4, 4, 2.5, NA, 3, 3))

I need some fast calculation for this, because I have about a million observations.

Some extra thoughts to this problem: farbe may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe, so there are 2 identical times and all the rest stays the same but different values for farbe. This should be just very rare case, but would be a great addition.

Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time is always there and numeric) or factor columns.

Thx in advance!

Perfect answered 18/5, 2020 at 18:15 Comment(5)
I do not understand the rules for dealing with non-unique factor columns (even if rare, the rules must be clear). E.g., what is your expected result, if you would add the row structure(list(time = 4, abst = 5, farbe = structure(3L, .Label = c("blau", "keine", "rot"), class = "factor"), gier = 5, goff = structure(3L, .Label = c("haus", "maus", "toll"), class = "factor"), huft = structure(2L, .Label = c("holz", "wolle"), class = "factor"), mode = 5), row.names = c(NA, -1L ), class = c("data.table", "data.frame")) to dt?Ismaelisman
IIUC, Your expected result assumes that there is at most one non-unique factor column per time (after NAs have been removed). What is your expected result if there are two or more non-unique factor columns?Ismaelisman
@Perfect What is wrong with my solution.Owe
@akrun, your solution is great and not wrong. But if you want to use less packages and don't want to define all the factor columns, the other solution by Uwe is more convenient.Perfect
My point is that the other solution is almost similar to mine. Anyway, it is kind of annoying when somebody makes a small change and gets the markOwe
I
3

The expected result (for the given sample dataset) can also be achieved without a subsequent call to separate_rows() or cSplit():

library(data.table) # version 1.12.9
dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
            else unlist(na.omit(unique(x)))), by = time]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2    0 keine  0.0 maus wolle  2.5
5:    3    0 keine  0.0 <NA> wolle  NaN
6:    4    3   rot  3.0 maus  holz  3.0
7:    4    3  blau  3.0 maus  holz  3.0

Please, note that this approach will work for an arbitrary mix of numeric and factor columns; no column names need to be stated explicitly.


However, I do believe the correct answer to the underlying problem is to return one row per time instead of a kind of partial aggregate (your mileage may vary, of course):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2    0    keine  0.0           maus wolle  2.5
3:    3    0    keine  0.0                wolle  NaN
4:    4    3 rot,blau  3.0           maus  holz  3.0

Please, note that list() instead of toString() has been used to aggregate the factor columns. This has the benefit to avoid problems in case one of the factor levels includes a comma , by chance. Furthermore, it is easier to identify cases with non-unique factors per time in a large production dataset:

# compute aggregate as before
dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
# find cases with non-unique factors per group
# note .SDcols = is.list is available with data.table version 1.12.9
tmp <- dtRes[, which(Reduce(sum, lapply(.SD, function(x) lengths(x) > 1L)) > 0), .SDcols = is.list, by = time]
tmp
   time V1
1:    1  1
2:    4  1
# show affected rows
dtRes[tmp, on = "time"]
   time abst    farbe gier           goff huft mode V1
1:    1    1    keine  2.5 haus,maus,toll         4  1
2:    4    3 rot,blau  3.0           maus holz    3  1
# show not affected rows
dtRes[!tmp, on = "time"]
   time abst farbe gier goff  huft mode
1:    2    0 keine    0 maus wolle  2.5
2:    3    0 keine    0      wolle  NaN
Ismaelisman answered 30/5, 2020 at 8:57 Comment(8)
thank you very much! The first part of your solution is easy to understand and is perfect for me. Your "believed" solution is also great. Maybe it will help me or anyone else later.Perfect
is it possible to aggregate dtRes after the use of your solution, so you take like the mean for every column of every 5 rows (not for the time column)?Perfect
@Bolle, Are you looking for a rolling mean with a window size of 5, i.e., rows 1 to 5, 2 to 6, 3 to 7, etc. or do you want to aggregate rows in chunks of 5, i.e., rows 1 to 5, 6 to 10, 11 to 15, etc.?Ismaelisman
rows in chunks. Is it better to convert the data table into a ts object for this? I don't have any dates or something like that but only the time in seconds like in my posted example, but with missing frequencies like 1, 2, 5, 6, 7, 9 etc. So maybe an aggregation makes sensePerfect
@Bolle, If there are gaps in the time series time, there are two options to aggregate in chunks: (a) using chunks of 5 rows (regardless of any gaps in time), or (b) taking a chunk size of 5 seconds (with a variable number of rows - even 0 if there are no rows within a 5 second period). What is your choice, please?Ismaelisman
(b) is better I think. I don't know if it is important but sometimes there are milliseconds in time like 1.2, 1.4, 2, 2.6 etc.Perfect
OK, I see. Seems to be an irregular time series. Please, can you search SO for matching questions (I believe this problem is not uncommon), or ask a new question with a good minimal reproducible example - Thank you.Ismaelisman
[convert irregular time series] (#62154498)Perfect
O
7

We can do a group by mean

library(data.table)
library(tidyr)
library(dplyr)
dt[, lapply(.SD, function(x) if(is.numeric(x)) mean(x, na.rm = TRUE)
     else toString(unique(x[!is.na(x)]))), .(time)] %>%
     separate_rows(farbe, goff)
# A tibble: 7 x 7
#   time  abst farbe  gier goff   huft     mode
#  <dbl> <dbl> <chr> <dbl> <chr>  <chr>   <dbl>
#1     1     1 keine   2.5 "haus" ""        4  
#2     1     1 keine   2.5 "maus" ""        4  
#3     1     1 keine   2.5 "toll" ""        4  
#4     2     0 keine   0   "maus" "wolle"   2.5
#5     3     0 keine   0   ""     "wolle" NaN  
#6     4     3 rot     3   "maus" "holz"    3  
#7     4     3 blau    3   "maus" "holz"    3  

Or with cSplit

library(splitstackshape)
cSplit(dt[, lapply(.SD, function(x) if(is.numeric(x)) 
    mean(x, na.rm = TRUE) else toString(unique(x[!is.na(x)]))), .(time)], 
    c('farbe', 'goff'), sep= ',\\s*', 'long', fixed = FALSE)
#   time abst farbe gier goff  huft mode
#1:    1    1 keine  2.5 haus        4.0
#2:    1    1  <NA>  2.5 maus        4.0
#3:    1    1  <NA>  2.5 toll        4.0
#4:    2    0 keine  0.0 maus wolle  2.5
#5:    3    0 keine  0.0 <NA> wolle  NaN
#6:    4    3   rot  3.0 maus  holz  3.0
#7:    4    3  blau  3.0 <NA>  holz  3.0
Owe answered 18/5, 2020 at 18:27 Comment(10)
this solution works, but is not really good if you have a lot more columns. I don't want to define every single column.Perfect
@Perfect if youu have more columns, can you update the post. thanksOwe
@Perfect in your expected output, the 'abst' is 1 for 'time' 1 where as in the input, there is not 1 for 'abst' for time' = 1. Can you please checkOwe
you are absolutely right. I updated the example again. Now the result should be right ;)Perfect
@Perfect II checked the example, 'dt' is still having 0 NA 2 NA NA for 'time' 1Owe
yeah, but I don't see where the problem is with that ((0 + 2) / 2 = 1). I updated dtRes instead into the expected output. I tried it with your solution and it worked. But I need something where I don't have to define every single column ;)Perfect
@Perfect sorry, I was thinking that 'abst' is unique. So sorryOwe
it is no problem. In the end I have found a mistake in my expected output because of your comment, but everything should be fine now :)Perfect
@Perfect Updated the post. can you check nowOwe
@Perfect I added two optionsOwe
I
3

The expected result (for the given sample dataset) can also be achieved without a subsequent call to separate_rows() or cSplit():

library(data.table) # version 1.12.9
dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
            else unlist(na.omit(unique(x)))), by = time]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2    0 keine  0.0 maus wolle  2.5
5:    3    0 keine  0.0 <NA> wolle  NaN
6:    4    3   rot  3.0 maus  holz  3.0
7:    4    3  blau  3.0 maus  holz  3.0

Please, note that this approach will work for an arbitrary mix of numeric and factor columns; no column names need to be stated explicitly.


However, I do believe the correct answer to the underlying problem is to return one row per time instead of a kind of partial aggregate (your mileage may vary, of course):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2    0    keine  0.0           maus wolle  2.5
3:    3    0    keine  0.0                wolle  NaN
4:    4    3 rot,blau  3.0           maus  holz  3.0

Please, note that list() instead of toString() has been used to aggregate the factor columns. This has the benefit to avoid problems in case one of the factor levels includes a comma , by chance. Furthermore, it is easier to identify cases with non-unique factors per time in a large production dataset:

# compute aggregate as before
dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
# find cases with non-unique factors per group
# note .SDcols = is.list is available with data.table version 1.12.9
tmp <- dtRes[, which(Reduce(sum, lapply(.SD, function(x) lengths(x) > 1L)) > 0), .SDcols = is.list, by = time]
tmp
   time V1
1:    1  1
2:    4  1
# show affected rows
dtRes[tmp, on = "time"]
   time abst    farbe gier           goff huft mode V1
1:    1    1    keine  2.5 haus,maus,toll         4  1
2:    4    3 rot,blau  3.0           maus holz    3  1
# show not affected rows
dtRes[!tmp, on = "time"]
   time abst farbe gier goff  huft mode
1:    2    0 keine    0 maus wolle  2.5
2:    3    0 keine    0      wolle  NaN
Ismaelisman answered 30/5, 2020 at 8:57 Comment(8)
thank you very much! The first part of your solution is easy to understand and is perfect for me. Your "believed" solution is also great. Maybe it will help me or anyone else later.Perfect
is it possible to aggregate dtRes after the use of your solution, so you take like the mean for every column of every 5 rows (not for the time column)?Perfect
@Bolle, Are you looking for a rolling mean with a window size of 5, i.e., rows 1 to 5, 2 to 6, 3 to 7, etc. or do you want to aggregate rows in chunks of 5, i.e., rows 1 to 5, 6 to 10, 11 to 15, etc.?Ismaelisman
rows in chunks. Is it better to convert the data table into a ts object for this? I don't have any dates or something like that but only the time in seconds like in my posted example, but with missing frequencies like 1, 2, 5, 6, 7, 9 etc. So maybe an aggregation makes sensePerfect
@Bolle, If there are gaps in the time series time, there are two options to aggregate in chunks: (a) using chunks of 5 rows (regardless of any gaps in time), or (b) taking a chunk size of 5 seconds (with a variable number of rows - even 0 if there are no rows within a 5 second period). What is your choice, please?Ismaelisman
(b) is better I think. I don't know if it is important but sometimes there are milliseconds in time like 1.2, 1.4, 2, 2.6 etc.Perfect
OK, I see. Seems to be an irregular time series. Please, can you search SO for matching questions (I believe this problem is not uncommon), or ask a new question with a good minimal reproducible example - Thank you.Ismaelisman
[convert irregular time series] (#62154498)Perfect

© 2022 - 2024 — McMap. All rights reserved.