This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time
and some other columns of different types (factors and numerics). Here is an example:
dt <- data.table(time = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
abst = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
gier = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
goff = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
huft = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
mode = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))
Now I want to combine the duplicate times in column time
. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.
dtRes <- data.table(time = c(1, 1, 1, 2, 3, 4, 4),
abst = c(1, 1, 1, 0, 0, 3, 3),
farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
gier = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
goff = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
huft = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
mode = c(4, 4, 4, 2.5, NA, 3, 3))
I need some fast calculation for this, because I have about a million observations.
Some extra thoughts to this problem: farbe
may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe
, so there are 2 identical times and all the rest stays the same but different values for farbe
. This should be just very rare case, but would be a great addition.
Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time
is always there and numeric) or factor columns.
Thx in advance!
structure(list(time = 4, abst = 5, farbe = structure(3L, .Label = c("blau", "keine", "rot"), class = "factor"), gier = 5, goff = structure(3L, .Label = c("haus", "maus", "toll"), class = "factor"), huft = structure(2L, .Label = c("holz", "wolle"), class = "factor"), mode = 5), row.names = c(NA, -1L ), class = c("data.table", "data.frame"))
todt
? – Ismaelismantime
(afterNA
s have been removed). What is your expected result if there are two or more non-unique factor columns? – Ismaelisman