How to replace NA with mean by group / subset?
Asked Answered
B

6

26

I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.

For example imagine I have a dataframe that looks something like:

id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA

In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.

Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

Here are a few things I've tried (that haven't worked):

# mean imputation to recode NA in length and width with means 
  (could do random imputation but unnecessary here)
mean.imp <- function(x) { 
  missing <- is.na(x) 
  n.missing <-sum(missing) 
  x.obs <-a[!missing] 
  imputed <- x 
  imputed[missing] <- mean(x.obs) 
  return (imputed) 
  } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])

n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
  mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr? 

another attempt:

imp.mean <- function(x) {
  a <- mean(x, na.rm = TRUE)
  return (ifelse (is.na(x) == TRUE , a, x)) 
 } # tried but not sure how to use this in ddply

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
  a <- mean(exampleDF$length, na.rm = TRUE)
  return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
  })

Any suggestions?

Bloodshot answered 17/2, 2012 at 4:10 Comment(2)
You should consider package mice for imputing values.Gayn
the mi package is also quite good. Amelia is much quicker than either mice or mi, but it does rely on your variables being multivariate normalTrantrance
S
50

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)


library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )
Sugar answered 17/2, 2012 at 4:38 Comment(3)
@Bloodshot Thank Hadley I found out where I stole this from: (LINK)Sugar
how do I impute using mutate if I have too many columns to do it individually?Alsworth
@JyothsnaHarithsa I'd use mutate_if most likely. Also see mutate_at and mutate_all.Sugar
R
3

Several other options:

1) with 's new nafill-function

library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, function(x) nafill(x, type = "const", fill = mean(x, na.rm = TRUE)))
    , by = taxa
    , .SDcols = cols][]

2) with 's na.aggregate-function

library(zoo)
library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, na.aggregate)
    , by = taxa
    , .SDcols = cols][]

The default function from na.aggregate is mean; if you want to use another function you should specify that with the FUN-parameter (example: FUN = median). See also the help-file with ?na.aggregate.

Of course you can also use this in the tidyverse:

library(dplyr)
library(zoo)

dat %>% 
  group_by(taxa) %>% 
  mutate_at(cols, na.aggregate)
Romanticism answered 28/10, 2019 at 14:50 Comment(0)
G
2

Before answering this, I want to say that am a beginner in R. Hence, please let me know if you feel my answer is wrong.

Code:

DF[is.na(DF$length), "length"] <- mean(na.omit(telecom_original_1$length))

and apply the same for width.

DF stands for name of the data.frame.

Thanks, Parthi

Goniometer answered 2/9, 2015 at 14:10 Comment(0)
F
1

R-base

Another R-base approach relying on vapply() + ave().

Class Coercion

> vapply(X = exampleDF, FUN = class, FUN.VALUE = "integer")
         id        taxa      length       width 
  "integer" "character" "character" "character" 

As the columns on which mean imputation should be performed on are of class "character", we coerce them to numeric beforehand:

exampleDF[, c("length", "width")] <- 
  lapply(exampleDF[, c("length", "width")], as.numeric)

Approach

# exampleDF[, c("length", "width")] <- 
vapply(X = exampleDF[, c("length", "width")], 
       FUN = \(x) {
         ave(x = x, 
             exampleDF[, "taxa"], # grouping 
             FUN = \(y) {
               y[is.na(y)] <- mean(y, na.rm = TRUE) 
               y 
               }
             )
       },
       FUN.VALUE = numeric(length = nrow(exampleDF))
       )

OP's data example

exampleDF <- 
  data.frame(id = seq(1:100), 
             taxa = c(rep("collembola", 50), rep("mite", 25), rep("ant", 25)), 
             length = c(rnorm(40, 1, 0.5), rep("NA", 10), 
                        rnorm(20, 0.8, 0.1), rep("NA", 5), 
                        rnorm(20, 2.5, 0.5), rep("NA", 5)), 
             width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
                       rnorm(20, 0.3, 0.01), rep("NA", 5), 
                       rnorm(20, 1, 0.1), rep("NA", 5)))

Wrapped in a samll function:

# In contrast to aggregate() and ave(), 
# this wrapper does not use non-standard evaluation 
impute = \(x, by, data) {
  if( !(any(is.numeric(data[, x])) | is.data.frame(data)) ) stop("Error")
  data[x] =
    vapply(data[x], \(i) {
      ave(x = i, data[by], # grouping 
          FUN = \(y) { y[is.na(y)] = mean(y, na.rm = TRUE); y })},
      FUN.VALUE = numeric(length = nrow(data))
    )
  return(data)
}

and applied to the data from @TylerRinker's answer:

> impute(x = c("length", "width"), by = "taxa", data = ori)
   id       taxa length width
1 101 collembola    2.1  0.90
2 102       mite    0.9  0.70
3 103       mite    1.1  0.80
4 104 collembola    1.8  0.70
5 105 collembola    1.5  0.50
6 106       mite    1.0  0.75

Data from @TylerRinker's answer:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header = TRUE)
Foreshank answered 16/11, 2023 at 12:14 Comment(0)
B
0

Expanding on @Tyler Rinker's solution, suppose features are the columns to impute. In this case features <- c('length', 'width'). Then using data.table the solution becomes:

library(data.table)
setDT(dat)

dat[, (features) := lapply(.SD, impute.mean), by = taxa, .SDcols = features]
Baro answered 7/1, 2017 at 3:59 Comment(0)
T
-1

I came across a similar incident and I can give a very simple step to mutate group-wise average for your columns.

library(tidyr)

dataset <- dataset %>% group_by(taxa) %>% mutate(length1= ifelse(is.na(length),mean(length,na.rm = T),length))

View(dataset)

Let me know if I can be of any further help.

Telesis answered 8/12, 2020 at 7:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.