Split dataframe into multiple dataframes by grouping columns
Asked Answered
N

5

7

I have a dataframe of expression data where gene are rows and columns are samples. I also have a dataframe containing metadata for each sample in the expression dataframe. In reality my expr dataframe has 30,000+ rows and 100+ columns. However, below is an example with smaller data.

expr <- data.frame(sample1 = c(1,2,2,0,0), 
                   sample2 = c(5,2,4,4,0), 
                   sample3 = c(1,2,1,0,1), 
                   sample4 = c(6,5,6,6,7), 
                   sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)
meta <- data.frame(sample = paste0("sample",1:5),
                   treatment = c("control","control",
                                 "treatment1", 
                                 "treatment2", "treatment2"))

I want to find the mean for each gene per treatment. From the examples I've seen with split() or group_by() people group based on a column already present in the data.frame. However, I have a separate dataframe (meta) that classifies the grouping for the columns in another dataframe (expr).

I would like my output to be a dataframe with genes as rows, treatment as columns, and values as the mean.

#        control   treatment1   treatment2
#  gene1  mean        mean         mean
#  gene2  mean        mean         mean
Nepotism answered 21/12, 2023 at 18:6 Comment(6)
Mean for each gene per treatment? Does this mean for treatment1 simply expr["sample3"] and for treatment2 simply rowMeans(expr[c("sample4", "sample5"])? This can be done with colnames(expr) = with(meta, paste0(sample, "_", treatment)); m1 = expr["sample3_treatment1"]; m2 = rowMeans(expr[c("sample4_treatment2", "sample5_treatment2")])?Plainlaid
@Plainlaid yes! But in reality my problem is much bigger and I have 30,000 rows and 100+ columns. So, I don't want to do this manually.Nepotism
Then I recommend to edit your toy data such that it mimics your real data accuratelyPlainlaid
@Plainlaid strongly disagree. Small sample data that is easy to work with is far preferable. We want minimal reproducible examples. There are exceptions, like if the focus of the question is on computational efficiency then larger data may be needed to get accurate timings. But generally minimal examples are much preferred.Dickerson
@GregorThomas we are talking past each other because I expressed myself badly. To be more precise, I have the feeling that the structure of colnames(expr) and sample names is more complex in the real data at hand compared to the toy data given.Plainlaid
Glad to hear you're not pushing for 30000x100 examples :)Dickerson
D
2

Something like this. It's not entirely clear what you want to group by in the last step, but you can adjust that easily.

library(dplyr)
library(tidyr)

expr |>
  mutate(gene = row.names(expr)) |>
  pivot_longer(-gene, names_to = "sample") |>
  left_join(meta, by = "sample") |>
  summarize(mean = mean(value), .by = c(gene, treatment)) |> 
  pivot_wider(names_from = treatment, values_from = mean)
# # A tibble: 5 × 4
#   gene  control treatment1 treatment2
#   <chr>   <dbl>      <dbl>      <dbl>
# 1 gene1       3          1        3  
# 2 gene2       2          2        2.5
# 3 gene3       3          1        3  
# 4 gene4       2          0        3.5
# 5 gene5       0          1        4  
Dickerson answered 21/12, 2023 at 18:17 Comment(2)
Sure you can ... |> pivot_wider(names_from = treatment, values_from = mean). Row names are bad and I don't think you should use them, but if you really want to after doing the above pivot_wider you can tack |> tibble::column_to_rownames(var = "gene")Dickerson
Thanks! I added pivot_wider(names_from = treatment, values_from = mean) to reformat output (i've edited my post since)Nepotism
P
6

An approach in base R which works for the particular toy data example given:

colnames(expr) = paste0(colnames(expr), "_", 
                        meta$treatment[match(colnames(expr), meta$sample)])
vapply(unique(meta$treatment), 
       \(i) rowMeans(expr[grepl(i, colnames(expr))]), numeric(nrow(expr)))
#>       control treatment1 treatment2
#> gene1       3          1        3.0
#> gene2       2          2        2.5
#> gene3       3          1        3.0
#> gene4       2          0        3.5
#> gene5       0          1        4.0

Data

expr <- data.frame(sample1 = c(1,2,2,0,0), 
                   sample2 = c(5,2,4,4,0), 
                   sample3 = c(1,2,1,0,1), 
                   sample4 = c(6,5,6,6,7), 
                   sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)

meta <- data.frame(sample = paste0("sample",1:5),
                   treatment = c("control","control",
                                 "treatment1", 
                                 "treatment2", "treatment2"))
Plainlaid answered 21/12, 2023 at 18:45 Comment(0)
O
4

A base R approach:

expr|>
    split.default(with(meta, treatment[match(names(expr), sample)]))|>
    lapply(rowMeans)|>
    structure(dim=3)|>
    array2DF()

        Var1 gene1 gene2 gene3 gene4 gene5
1    control     3   2.0     3   2.0     0
2 treatment1     1   2.0     1   0.0     1
3 treatment2     3   2.5     3   3.5     4
Ovate answered 21/12, 2023 at 19:34 Comment(0)
D
2

Something like this. It's not entirely clear what you want to group by in the last step, but you can adjust that easily.

library(dplyr)
library(tidyr)

expr |>
  mutate(gene = row.names(expr)) |>
  pivot_longer(-gene, names_to = "sample") |>
  left_join(meta, by = "sample") |>
  summarize(mean = mean(value), .by = c(gene, treatment)) |> 
  pivot_wider(names_from = treatment, values_from = mean)
# # A tibble: 5 × 4
#   gene  control treatment1 treatment2
#   <chr>   <dbl>      <dbl>      <dbl>
# 1 gene1       3          1        3  
# 2 gene2       2          2        2.5
# 3 gene3       3          1        3  
# 4 gene4       2          0        3.5
# 5 gene5       0          1        4  
Dickerson answered 21/12, 2023 at 18:17 Comment(2)
Sure you can ... |> pivot_wider(names_from = treatment, values_from = mean). Row names are bad and I don't think you should use them, but if you really want to after doing the above pivot_wider you can tack |> tibble::column_to_rownames(var = "gene")Dickerson
Thanks! I added pivot_wider(names_from = treatment, values_from = mean) to reformat output (i've edited my post since)Nepotism
Q
2

Here is a data.table approach with the same logic provided by @Gregor Thomas:

library(data.table)

expr_dt <- setDT(expr)
expr_dt[, gene := rownames(expr)]

meta_dt <- setDT(meta)

melt(expr_dt, id.vars = "gene", variable.name = "sample", value.name = "expression")[
  meta_dt, on = .(sample)][
    , .(mean = mean(expression)), by = .(gene, treatment)][
      , dcast(.SD, gene ~ treatment, value.var = "mean")]
   gene control treatment1 treatment2
1:    1       3          1        3.0
2:    2       2          2        2.5
3:    3       3          1        3.0
4:    4       2          0        3.5
5:    5       0          1        4.0
Quackery answered 21/12, 2023 at 19:16 Comment(0)
C
2

Here's another approach in tidyverse, replacing sample numbers with treatments from meta by creating a named vector instead of left_join(), and also using values_fn within pivot_wider() instead of summarise():

library(dplyr)
library(tidyr)

expr %>% 
  tibble::rownames_to_column("gene") %>% 
  pivot_longer(-gene) %>% 
  mutate(name = split(meta$treatment, meta$sample)[name]) %>%
  pivot_wider(values_fn = mean)

#> # A tibble: 5 × 4
#>   gene  control treatment1 treatment2
#>   <chr>   <dbl>      <dbl>      <dbl>
#> 1 gene1       3          1        3  
#> 2 gene2       2          2        2.5
#> 3 gene3       3          1        3  
#> 4 gene4       2          0        3.5
#> 5 gene5       0          1        4

Created on 2023-12-21 with reprex v2.0.2

Cheerless answered 21/12, 2023 at 20:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.