Calculate rowMeans on a range of column (Variable number)

Asked 27/7, 2016 at 16:5 Answered 27/7, 2016 at 18:47

I want to calculate rowMeans of a range of column but I cannot give the hard-coded value for colnames (e.g c(C1,C3)) or range (e.g. C1:C3) as both names and range are variable. My df looks like:

> df
  chr name age  MGW.1 MGW.2  MGW.3 HEL.1 HEL.2 HEL.3
1 123  abc  12  10.00    19  18.00    12 13.00   -14
2 234  bvf  24 -13.29    13  -3.02    12 -0.12    24
3 376  bxc  17  -6.95    10 -18.00    15  4.00    -4

This is just a sample, in reality I have columns ranging in MGW.1 ... MGW.196 and so. Here Instead of giving the exact colnames or an exact range I want to pass initial of colnames and want to get average of all columns having that initials. Something like: MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])

So my final output should look like:

> df
      chr name age  MGW      Hel
    1 123  abc  12  10.00    19
    2 234  bvf  24  13.29    13
    3 376  bxc  17  -6.95    10

I know these values are not correct but it is just to give you and idea. Secondly I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values.

Here is the dput for sample example:

> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc", 
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10, 
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02, 
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L, 
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2", 
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA, 
-3L))

Viscosity answered 27/7, 2016 at 16:5 Comment(4)

I asked a related question yesterday and the answers might help you out. Here's the link #38595308 – Vaticinal 27/7, 2016 at 16:9

@Vaticinal As I mentioned in my question, I cannot explicitly mention the colnames or indices, because they are variable, sometimes there will be 196 columns for which I want a rowMean and sometimes there will be 198 or so. – Viscosity 27/7, 2016 at 16:15

You can subset df with a "logical" vector of positions where names(df) ?startsWith "MGW" etc. Also, see, ?complete.cases to find rows that contain only NA (after subsetting all but first three columns). – Comparable 27/7, 2016 at 16:15

Can you hardcode the colnames prefix (MGW.*, HEL.*) ? Or do you need to regroup them programmatically ? – Autograph 27/7, 2016 at 18:8

Here's an idea achieving your desired output without hardcoding variable names:

library(dplyr)
library(tidyr)

df %>%
  # remove rows where all values are NA except the first 3 columns
  filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
  # gather the data in a tidy format
  gather(key, value, -(chr:age)) %>%
  # separate the key column into label and num allowing 
  # to regroup by variables without hardcoding them
  separate(key, into = c("label", "num")) %>%
  group_by(chr, name, age, label) %>%
  # calculate the mean
  summarise(mean = mean(value, na.rm = TRUE)) %>%
  spread(label, mean)

I took the liberty to modify your initial data to show how the logic would fit special cases. For example, here we have a row (#4) where all values but the first 3 columns are NAs (according to your requirements, this row should be removed) and one where there is a mix of NAs and values (#5). In this case, I assumed we would like to have a result for MGW since there is a value at MGW.1:

#  chr name age  MGW.1 MGW.2  MGW.3 HEL.1 HEL.2 HEL.3
#1 123  abc  12  10.00    19  18.00    12 13.00   -14
#2 234  bvf  24 -13.29    13  -3.02    12 -0.12    24
#3 376  bxc  17  -6.95    10 -18.00    15  4.00    -4
#4 999  zzz  21     NA    NA     NA    NA    NA    NA
#5 888  aaa  12  10.00    NA     NA    NA    NA    NA

Which gives:

#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
#    chr   name   age       HEL       MGW
#* <int> <fctr> <int>     <dbl>     <dbl>
#1   123    abc    12  3.666667 15.666667
#2   234    bvf    24 11.960000 -1.103333
#3   376    bxc    17  5.000000 -4.983333
#4   888    aaa    12       NaN 10.000000

Data

df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L, 
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10, 
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18, 
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13, 
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr", 
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Hypnogenesis answered 27/7, 2016 at 18:47 Comment(0)

Firstly

I think you are looking for this to get mean of rows:

df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])

And to delete the columns afterwards:

df[, grep("^HEL.", names(df))] <- NULL

Secondly

To delete rows which have only NA after the first three elements.

rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]

Candi answered 27/7, 2016 at 16:16 Comment(7)

This will append a new columns Hel.mean I want to delete the individual column from df as well, please refer to my desired output mention above. – Viscosity 27/7, 2016 at 16:19

Added a line to delete the columns. – Candi 27/7, 2016 at 16:23

What do you mean by Secondly... ?? .. I think you are saying that repeat it for MGW and so on all the columns you want, right ? – Viscosity 27/7, 2016 at 16:24

"I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values." I was just going to post a solution to that but it had some problems. – Candi 27/7, 2016 at 16:26

Can you guide me how can I append one more column in df which contains SD of the entire row. I tried to follow ŧhis but it is not working for me. – Viscosity 4/8, 2016 at 8:58

I am trying this

> rowSds_R <- function(x, na.rm=T) { +     suppressWarnings({ +         apply(x, MARGIN=1L, FUN=sd, na.rm=T) +     }) + } > df$HelT.SD <- rowSds_R(df[, grep("^HelT.", names(df))])

and I think it is working. – Viscosity 4/8, 2016 at 9:4

Maybe better to open a new question for this if you don't find solution... – Candi 21/8, 2016 at 17:5

Here's an idea achieving your desired output without hardcoding variable names:

library(dplyr)
library(tidyr)

df %>%
  # remove rows where all values are NA except the first 3 columns
  filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
  # gather the data in a tidy format
  gather(key, value, -(chr:age)) %>%
  # separate the key column into label and num allowing 
  # to regroup by variables without hardcoding them
  separate(key, into = c("label", "num")) %>%
  group_by(chr, name, age, label) %>%
  # calculate the mean
  summarise(mean = mean(value, na.rm = TRUE)) %>%
  spread(label, mean)

#  chr name age  MGW.1 MGW.2  MGW.3 HEL.1 HEL.2 HEL.3
#1 123  abc  12  10.00    19  18.00    12 13.00   -14
#2 234  bvf  24 -13.29    13  -3.02    12 -0.12    24
#3 376  bxc  17  -6.95    10 -18.00    15  4.00    -4
#4 999  zzz  21     NA    NA     NA    NA    NA    NA
#5 888  aaa  12  10.00    NA     NA    NA    NA    NA

Which gives:

#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
#    chr   name   age       HEL       MGW
#* <int> <fctr> <int>     <dbl>     <dbl>
#1   123    abc    12  3.666667 15.666667
#2   234    bvf    24 11.960000 -1.103333
#3   376    bxc    17  5.000000 -4.983333
#4   888    aaa    12       NaN 10.000000

Data

df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L, 
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10, 
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18, 
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13, 
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr", 
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Hypnogenesis answered 27/7, 2016 at 18:47 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags