Convert a list of character vectors into a binary matrix
Asked Answered
C

5

10

I am working in R and have a named list of character vectors. Each vector describes the genes present in a biological pathway.

Please see example below:

gene_sets = list(pathwayX= c("Gene3"),pathwayY= c("Gene2", "Gene3"),pathwayz= c("Gene1", "Gene2","Gene3"))

> gene_sets
$pathwayX
"Gene3"

$pathwayY
"Gene2" "Gene3"

$pathwayZ
"Gene1" "Gene2" "Gene3"

What I want to do is to convert the list into a binary matrix, with genes as columns and pathways as rows. 0 means gene is not present in the pathway and 1 means gene is present.

Gene1 Gene2 Gene3
pathwayX 0 0 1
pathwayY 0 1 1
pathwayZ 1 1 1
Congratulant answered 18/7, 2022 at 6:55 Comment(0)
S
7

We can use

t(+sapply(gene_sets, "%in%", x = c("Gene1", "Gene2", "Gene3")))

If you want to dynamically obtain c("Gene1", "Gene2", "Gene3"), we can do

GeneID <- sort(unique(unlist(gene_sets)))

mat <- t(+sapply(gene_sets, "%in%", x = GeneID))  ## matrix output
colnames(mat) <- GeneID
#         Gene1 Gene2 Gene3
#pathwayX     0     0     1
#pathwayY     0     1     1
#pathwayz     1     1     1

data.frame(mat)  ## data.frame output

My impression is that genes problem is usually large and sparse. If you have hundreds of thousands of genes and pathways in reality, the following sparse matrix solution is the best option.

pathwayID <- names(gene_sets)
n1 <- lengths(gene_sets, use.names = FALSE)  ## number of genes in each pathway
genesVec <- unlist(gene_sets, use.names = FALSE)
GeneID <- sort(unique(genesVec))
i <- rep(1:length(n1), n1)
j <- match(genesVec, GeneID)
Matrix::sparseMatrix(i = i, j = j, x = rep.int(1, length(i)),
                     dimnames = list(pathwayID, GeneID))
#3 x 3 sparse Matrix of class "dgCMatrix"
#         Gene1 Gene2 Gene3
#pathwayX     .     .     1
#pathwayY     .     1     1
#pathwayz     1     1     1
Singularize answered 18/7, 2022 at 6:58 Comment(1)
Tracing mtabulate in Quinten's answer, I found the Q&A that motivated this function: create a data frame of tag frequencies from a list of tag vectors. I have voted up that question. I shall update my answer with a benchmark when I am available.Singularize
J
7

An idea is to use table in combination with stack i.e.

t(table(stack(gene_sets)))

          values
ind        Gene1 Gene2 Gene3
  pathwayX     0     0     1
  pathwayY     0     1     1
  pathwayz     1     1     1
Julee answered 18/7, 2022 at 7:12 Comment(0)
N
5

You could also use mtabulate from qdapTools like this:

gene_sets = list(pathwayX= c("gene3"),pathwayY= c("gene2", "gene3"),pathwayz= c("gene1", "gene2","gene3"))

library(qdapTools)
mtabulate(gene_sets)
#>          gene1 gene2 gene3
#> pathwayX     0     0     1
#> pathwayY     0     1     1
#> pathwayz     1     1     1

Created on 2022-07-18 by the reprex package (v2.0.1)

Numbat answered 18/7, 2022 at 7:10 Comment(0)
A
3

Here is a try to solve it with tidyverse:

library(dplyr)
library(tibble)
as.data.frame(unlist(gene_sets)) %>% 
  transmute(gene = as.factor(`unlist(gene_sets)`)) %>% 
  rownames_to_column() %>% 
  mutate(rowname = str_remove(rowname, "[0-9]")) %>% 
  cbind((model.matrix(~ gene + 0, .) == 1)*1) %>% 
  rename_with(., ~str_replace_all(., "geneGene", "Gene")) %>% 
  group_by(rowname) %>% 
  summarise(across(-gene, ~sum(.)))
  rowname  Gene1 Gene2 Gene3
  <chr>    <dbl> <dbl> <dbl>
1 pathwayX     0     0     1
2 pathwayY     0     1     1
3 pathwayz     1     1     1
Antiar answered 18/7, 2022 at 7:52 Comment(0)
W
1

Here is a somewhat concise tidyverse way if one were compelled to go down that path.

library(dplyr)
library(purrr)
library(tibble)
library(tidyr)

map_dfr(gene_sets,
        ~ as_tibble_row(set_names(rep(1L, length(.x)), .x)), .id = "row") %>% 
  relocate(row, order(colnames(.))) %>% 
  mutate(across(-1, replace_na, 0))
# A tibble: 3 x 4
  row      Gene1 Gene2 Gene3
  <chr>    <int> <int> <int>
1 pathwayX     0     0     1
2 pathwayY     0     1     1
3 pathwayz     1     1     1
Welter answered 18/7, 2022 at 12:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.