Split dataframe into multiple dataframes by grouping columns

Asked 21/12, 2023 at 18:6 Answered 21/12, 2023 at 20:8

I have a dataframe of expression data where gene are rows and columns are samples. I also have a dataframe containing metadata for each sample in the expression dataframe. In reality my expr dataframe has 30,000+ rows and 100+ columns. However, below is an example with smaller data.

expr <- data.frame(sample1 = c(1,2,2,0,0), 
                   sample2 = c(5,2,4,4,0), 
                   sample3 = c(1,2,1,0,1), 
                   sample4 = c(6,5,6,6,7), 
                   sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)
meta <- data.frame(sample = paste0("sample",1:5),
                   treatment = c("control","control",
                                 "treatment1", 
                                 "treatment2", "treatment2"))

I want to find the mean for each gene per treatment. From the examples I've seen with split() or group_by() people group based on a column already present in the data.frame. However, I have a separate dataframe (meta) that classifies the grouping for the columns in another dataframe (expr).

I would like my output to be a dataframe with genes as rows, treatment as columns, and values as the mean.

#        control   treatment1   treatment2
#  gene1  mean        mean         mean
#  gene2  mean        mean         mean

Nepotism answered 21/12, 2023 at 18:6 Comment(6)

Mean for each gene per treatment? Does this mean for treatment1 simply expr["sample3"] and for treatment2 simply rowMeans(expr[c("sample4", "sample5"])? This can be done with

colnames(expr) = with(meta, paste0(sample, "_", treatment)); m1 = expr["sample3_treatment1"]; m2 = rowMeans(expr[c("sample4_treatment2", "sample5_treatment2")])

? – Plainlaid 21/12, 2023 at 18:14

@Plainlaid yes! But in reality my problem is much bigger and I have 30,000 rows and 100+ columns. So, I don't want to do this manually. – Nepotism 21/12, 2023 at 18:16

Then I recommend to edit your toy data such that it mimics your real data accurately – Plainlaid 21/12, 2023 at 18:18

@Plainlaid strongly disagree. Small sample data that is easy to work with is far preferable. We want minimal reproducible examples. There are exceptions, like if the focus of the question is on computational efficiency then larger data may be needed to get accurate timings. But generally minimal examples are much preferred. – Dickerson 21/12, 2023 at 18:45

@GregorThomas we are talking past each other because I expressed myself badly. To be more precise, I have the feeling that the structure of colnames(expr) and sample names is more complex in the real data at hand compared to the toy data given. – Plainlaid 21/12, 2023 at 18:48

Glad to hear you're not pushing for 30000x100 examples :) – Dickerson 21/12, 2023 at 18:50

Something like this. It's not entirely clear what you want to group by in the last step, but you can adjust that easily.

library(dplyr)
library(tidyr)

expr |>
  mutate(gene = row.names(expr)) |>
  pivot_longer(-gene, names_to = "sample") |>
  left_join(meta, by = "sample") |>
  summarize(mean = mean(value), .by = c(gene, treatment)) |> 
  pivot_wider(names_from = treatment, values_from = mean)
# # A tibble: 5 × 4
#   gene  control treatment1 treatment2
#   <chr>   <dbl>      <dbl>      <dbl>
# 1 gene1       3          1        3  
# 2 gene2       2          2        2.5
# 3 gene3       3          1        3  
# 4 gene4       2          0        3.5
# 5 gene5       0          1        4

Dickerson answered 21/12, 2023 at 18:17 Comment(2)

Sure you can ... |> pivot_wider(names_from = treatment, values_from = mean). Row names are bad and I don't think you should use them, but if you really want to after doing the above pivot_wider you can tack |> tibble::column_to_rownames(var = "gene") – Dickerson 21/12, 2023 at 18:41

Thanks! I added pivot_wider(names_from = treatment, values_from = mean) to reformat output (i've edited my post since) – Nepotism 21/12, 2023 at 18:46

An approach in base R which works for the particular toy data example given:

colnames(expr) = paste0(colnames(expr), "_", 
                        meta$treatment[match(colnames(expr), meta$sample)])
vapply(unique(meta$treatment), 
       \(i) rowMeans(expr[grepl(i, colnames(expr))]), numeric(nrow(expr)))
#>       control treatment1 treatment2
#> gene1       3          1        3.0
#> gene2       2          2        2.5
#> gene3       3          1        3.0
#> gene4       2          0        3.5
#> gene5       0          1        4.0

Data

expr <- data.frame(sample1 = c(1,2,2,0,0), 
                   sample2 = c(5,2,4,4,0), 
                   sample3 = c(1,2,1,0,1), 
                   sample4 = c(6,5,6,6,7), 
                   sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)

meta <- data.frame(sample = paste0("sample",1:5),
                   treatment = c("control","control",
                                 "treatment1", 
                                 "treatment2", "treatment2"))

Plainlaid answered 21/12, 2023 at 18:45 Comment(0)

A base R approach:

expr|>
    split.default(with(meta, treatment[match(names(expr), sample)]))|>
    lapply(rowMeans)|>
    structure(dim=3)|>
    array2DF()

        Var1 gene1 gene2 gene3 gene4 gene5
1    control     3   2.0     3   2.0     0
2 treatment1     1   2.0     1   0.0     1
3 treatment2     3   2.5     3   3.5     4

Ovate answered 21/12, 2023 at 19:34 Comment(0)

Something like this. It's not entirely clear what you want to group by in the last step, but you can adjust that easily.

library(dplyr)
library(tidyr)

expr |>
  mutate(gene = row.names(expr)) |>
  pivot_longer(-gene, names_to = "sample") |>
  left_join(meta, by = "sample") |>
  summarize(mean = mean(value), .by = c(gene, treatment)) |> 
  pivot_wider(names_from = treatment, values_from = mean)
# # A tibble: 5 × 4
#   gene  control treatment1 treatment2
#   <chr>   <dbl>      <dbl>      <dbl>
# 1 gene1       3          1        3  
# 2 gene2       2          2        2.5
# 3 gene3       3          1        3  
# 4 gene4       2          0        3.5
# 5 gene5       0          1        4

Dickerson answered 21/12, 2023 at 18:17 Comment(2)

Thanks! I added pivot_wider(names_from = treatment, values_from = mean) to reformat output (i've edited my post since) – Nepotism 21/12, 2023 at 18:46

Here is a data.table approach with the same logic provided by @Gregor Thomas:

library(data.table)

expr_dt <- setDT(expr)
expr_dt[, gene := rownames(expr)]

meta_dt <- setDT(meta)

melt(expr_dt, id.vars = "gene", variable.name = "sample", value.name = "expression")[
  meta_dt, on = .(sample)][
    , .(mean = mean(expression)), by = .(gene, treatment)][
      , dcast(.SD, gene ~ treatment, value.var = "mean")]

   gene control treatment1 treatment2
1:    1       3          1        3.0
2:    2       2          2        2.5
3:    3       3          1        3.0
4:    4       2          0        3.5
5:    5       0          1        4.0

Quackery answered 21/12, 2023 at 19:16 Comment(0)

Here's another approach in tidyverse, replacing sample numbers with treatments from meta by creating a named vector instead of left_join(), and also using values_fn within pivot_wider() instead of summarise():

library(dplyr)
library(tidyr)

expr %>% 
  tibble::rownames_to_column("gene") %>% 
  pivot_longer(-gene) %>% 
  mutate(name = split(meta$treatment, meta$sample)[name]) %>%
  pivot_wider(values_fn = mean)

#> # A tibble: 5 × 4
#>   gene  control treatment1 treatment2
#>   <chr>   <dbl>      <dbl>      <dbl>
#> 1 gene1       3          1        3  
#> 2 gene2       2          2        2.5
#> 3 gene3       3          1        3  
#> 4 gene4       2          0        3.5
#> 5 gene5       0          1        4

^{Created on 2023-12-21 with reprex v2.0.2}

Cheerless answered 21/12, 2023 at 20:8 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags