I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
abst = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
gier = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))
Now I want to combine the duplicate IDs. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.
In the end the result should look like this
dfRes <- data.frame(id = c(1, 2, 3, 4),
abst = c(1, 0, 0, 3),
farbe = as.factor(c("keine", "keine", "keine", "rot")),
gier = c(2.5, 0, 0, 3))
I hope there is a way to do the calculations fast, because I have about a million observations. Thanks in advance!
Edit (additions): "farbe" may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe", so there are 2 identical IDs and all the same but different values for "farbe". This should be just very rare case, but a great addition.
I have a lot more numeric and factor columns in my real data. Is it also possible to create a solution, so I don't have to define every single column?
farbe
always be the same for the sameid
? – Alliancefarbe
(at least mine does). If not, you are the only one who knows what to do with different values. After that we could modify the answers. – Glossematics