R combining duplicate rows by ID with different column types in a dataframe

Asked 14/5, 2020 at 13:8 Answered 14/5, 2020 at 13:49

r dataframe data.table aggregate tidyverse

I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Now I want to combine the duplicate IDs. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

In the end the result should look like this

dfRes <- data.frame(id    = c(1, 2, 3, 4),
                    abst  = c(1, 0, 0, 3),
                    farbe = as.factor(c("keine", "keine", "keine", "rot")),
                    gier  = c(2.5, 0, 0, 3))

I hope there is a way to do the calculations fast, because I have about a million observations. Thanks in advance!

Edit (additions): "farbe" may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe", so there are 2 identical IDs and all the same but different values for "farbe". This should be just very rare case, but a great addition.

I have a lot more numeric and factor columns in my real data. Is it also possible to create a solution, so I don't have to define every single column?

Ericson answered 14/5, 2020 at 13:8 Comment(4)

Will farbe always be the same for the same id? – Alliance 14/5, 2020 at 13:11

good question, I hope so. Otherwise it is not that big of a problem. But there has to be a decision then, which of the values should be used. – Ericson 14/5, 2020 at 13:51

I think all the solutions posted so far assume unique farbe (at least mine does). If not, you are the only one who knows what to do with different values. After that we could modify the answers. – Glossematics 14/5, 2020 at 14:26

You are right. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe". This should be just very rare cases and thats fine. – Ericson 14/5, 2020 at 14:39

EDIT:

Just saw your edit about non-unique factor columns and selecting columns by type. This will work but I will think about a cleaner way to do this and report back (I am sure there is a simple way). If you want to manually specify columns like the original example and you have non-unique factors, just use unlist() with unique() in the same fashion as below. Alternatively, you could consider combining all factor levels on one line using paste() with collapse = "; " or something to that effect. If you want to change the column order for the final data.table, use setcolorder() on the data.table

setDT(df)

# For selecting columns later
num_cols <- sapply(df, is.numeric)
num_cols[names(num_cols) == "id"] <- FALSE
fac_cols <- sapply(df, is.factor)

df[, lapply(.SD, mean, na.rm = T), by = id, .SDcols = num_cols][
  df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols], on = "id"]

   id abst gier farbe
1:  1    1  2.5 keine
2:  2    0  0.0 keine
3:  3    0  0.0 keine
4:  4    3  3.0  rot2
5:  4    3  3.0   rot

How it works: It joins the numeric column summary

df[, lapply(.SD, mean, na.rm = TRUE), by = id, .SDcols = num_cols]

with the factor column summary

df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols]

Data for edit:

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, "rot2", "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Original answer:

Here is one of many data.table solutions. This orders the data.table by the factor column so it can grab the top value while summarizing. I also converted it back to a pure data.frame but you do not have to do that if you do not want to. Hope this helps!

Also, this assumes that farbe will be the same for each id

library(data.table)

setDT(df)

df <- df[order(farbe), .(abst = mean(abst, na.rm = TRUE),
                         farbe = farbe[1],
                         gier = mean(gier, na.rm = TRUE)), by = id]

setDF(df)
df
  id abst farbe gier
1  1    1 keine  2.5
2  2    0 keine  0.0
3  3    0 keine  0.0
4  4    3   rot  3.0

Alliance answered 14/5, 2020 at 13:25 Comment(7)

why is this data.table solution so much faster (like 1 second for 1 mio observations) than the dplyr solutions (over 10 minutes!)? Is it possible to this for all numeric and factor columns? I have a lot more columns, so it would be easier to not define every single column. Is there a way to remove the NaNs? (not NAs) – Ericson 14/5, 2020 at 14:22

Hey @Bolle, hope the edit helps. Not 100% sure what you mean about the NaN's--you would have to post more data or another question. I am guessing it is because you are taking the mean of a group with only NA values (i.e., mean(c(NA, NA), na.rm = T)). Also, main reason the dplyr solution was probably slower is because it called na.omit() for every column after grouping. Just a guess though. – Alliance 14/5, 2020 at 18:48

@ Andrew you new solution looks nice. I tested it for a datatable with a few more columns. It works but I get a warning: "Item 1 of j's result for group 32133 is zero length. This will be filled with 2 NAs to match the longest column in this result." Is that a big problem? What does is it mean exactly and how can I solve this? – Ericson 15/5, 2020 at 14:19

Hey @Bolle, without seeing the data,I really don't know. I would manually check out the group by adding an index using .GRP and see what's going on. Pre/post summary. Hope this helps! – Alliance 15/5, 2020 at 15:16

I like your solution, it is good to read and understandable. Is there a difference if I tell you it is a time series so the IDs are all different? Because you solution works great but it is kind of slow. There is also a another problem. If you don't have factor columns in your datatable all observations will be empty. – Ericson 18/5, 2020 at 7:43

Hey @Bolle, I would probably post a new questions with some sample data that demonstrates what you are talking about. I will be happy to answer that questions if somebody does not answer before me, but that would be my advice. – Alliance 18/5, 2020 at 10:0

thx! I posted a new question ;) I hope it is understandable combining duplicate rows in a time series with different column types in a datatable – Ericson 18/5, 2020 at 18:17

A dplyr solution.

library(dplyr)

df %>% 
  group_by(id) %>% 
  summarise(abst = mean(na.omit(abst)), 
            farbe = na.omit(farbe)[1],
            gier = mean(na.omit(gier)))
#> # A tibble: 4 x 4
#>      id  abst farbe  gier
#>   <dbl> <dbl> <fct> <dbl>
#> 1     1     1 keine   2.5
#> 2     2     0 keine   0  
#> 3     3     0 keine   0  
#> 4     4     3 rot     3

^{Created on 2020-05-14 by the reprex package (v0.3.0)}

Thibaud answered 14/5, 2020 at 13:15 Comment(2)

nice solution. How can I achieve this if I have a lot of numeric and factor columns? Do I have to define every single column? – Ericson 14/5, 2020 at 13:39

I tried this solution on my data with about 1 mio obervation. it takes over 10 minutes! Why is that? I tried a data.table solution and it took like 1 second. – Ericson 14/5, 2020 at 14:19

Also a data.table solution:

    library(data.table)
    df <- data.table( # instead of data.frame
                     id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4), 
                     abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                     farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                     gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))


    newdf <- df[,
                .(abst=mean(abst,na.rm=T), # perform mean of abst removing NAs
                  farbe=na.omit(unique(farbe)), # assuming farbe is unique for each ID, extract the unique value after removing the NAs
                  gier=mean(gier,na.rm=T)), # perform mean of gier removing NAs
                by=id] # for each ID

    newdf

       id abst farbe gier
    1:  1    1 keine  2.5
    2:  2    0 keine  0.0
    3:  3    0 keine  0.0
    4:  4    3   rot  3.0

Glossematics answered 14/5, 2020 at 13:19 Comment(0)

additional solution

df %>% 
  group_by(id) %>% 
  fill(farbe, .direction = "updown") %>% 
  group_by(id, farbe) %>% 
  summarise_all(~ mean(., na.rm = T))

Tabb answered 14/5, 2020 at 13:49 Comment(0)

Recommended topics

Hot tags