Extract intersection list from upset object

Asked 26/11, 2020 at 17:57 Answered 3/11, 2022 at 22:8

Solved r set-intersection set-operations upsetr

I'm making some comparisons with UpSetR, and I'd like to save the lists of elements that fall into each intersection. Is this possible? I can't find it anywhere...

It would be pretty tedious to do it manually (many lists), and since they're calculated anyway not being able to save them is frustrating

Milkwhite answered 26/11, 2020 at 17:57 Comment(5)

Maybe this post can help - https://mcmap.net/q/573746/-intersection-of-lists-in-r – Mckinley 26/11, 2020 at 18:12

aren't you the one that defines the plots? – Talaria 26/11, 2020 at 18:17

@Mckinley thanks, it seems helpful, I was getting tangled trying to do it in dplyr:) I'm just surprised that there isn't an option to do it from the upset() output, I wouldn't expect it to be complicated implement, all the intersections are calculated anyway.. just need a handle to fetch them, I guess.. @ Abdessabour Mtk I'm not sure I got your point, but in upsetR you define the sets you want to compare and all the intersection are done under the hood and directly plotted.. that's why it's handy :) – Milkwhite 26/11, 2020 at 20:52

If you do x <- upset(...) it returns the data used for plotting. I couldn't easily see where the info you are after in that x object. – Mckinley 26/11, 2020 at 20:56

You could always ask for a "feature request" at: github.com/hms-dbmi/UpSetR/issues – Mckinley 26/11, 2020 at 20:58

There is no ready upSetR function for this (yet). But, it is possible to extract it:

library(UpSetR)

# Example input as list, expected output is 1 and 5:
listInput <- list(one = c(1, 2, 3, 5, 7, 8, 11, 12, 13), 
                  two = c(1, 2, 4, 5, 10),
                  three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))

When assigned upset returns a value, which also includes the data:

x <- upset(fromList(listInput))
x$New_data
#    one two three
# 1    1   1     1
# 2    1   1     0
# 3    1   0     0
# 4    1   1     1
# 5    1   0     1
# 6    1   0     1
# 7    1   0     0
# 8    1   0     1
# 9    1   0     1
# 10   0   1     0
# 11   0   1     1
# 12   0   0     1
# 13   0   0     1

From here we can see it is 1st and the 4th rows are found in all three sets. The order of items are defined based on the order they appear in the list, see:

x1 <- unlist(listInput, use.names = FALSE)
x1 <- x1[ !duplicated(x1) ]
x1
# [1]  1  2  3  5  7  8 11 12 13  4 10  6  9

Now we know the rownumbers from "New_data" refer to in our list. So, as we have 3 columns, filter rows where sum is 3:

x1[ rowSums(x$New_data) == 3 ]
# [1] 1 5

Or we could just use Reduce:

Reduce(intersect, listInput)
# [1] 1 5

Mckinley answered 26/11, 2020 at 21:23 Comment(0)

Here is my solution. It is bio-related, but should be easy to translate to other fields. I start with a list of vectors. In my case a list of genes (list of characters) belonging to different signatures (different sets).

str(list_filter)
List of 9
 $ CellAge_Induces         : chr [1:153] "AAK1" "ABI3" "ADCK5" "AGT" ...
 $ CellAge_Inhibits        : chr [1:121] "ACLY" "AKR1B1" "ASPH" "ATF7IP" ...
 $ CLASSICAL_SASP          : chr [1:38] "BGN" "CCL2" "CCL20" "COL1A1" ...
 $ FRIDMAN_SENESCENCE_UP   : chr [1:77] "ALDH1A3" "CCND1" "CD44" "CDKN1A" ...
 $ ISM_SCORE               : chr [1:128] "HSH2D" "OTOF" "TRIM69" "PSME1" ...
 $ MOSERLE_IFNA_RESPONSE   : chr [1:31] "CD274" "CMPK2" "CXCL10" "DDX58" ...
 $ REACTOME_SENESCENCE_SASP: chr [1:110] "ANAPC1" "ANAPC10" "ANAPC11" "ANAPC15" ...
 $ SAEPHIA_CURATED_SASP    : chr [1:38] "IL1A" "IL1B" "CXCL10" "CXCL1" ...
 $ senmayo                 : chr [1:125] "ACVR1B" "ANG" "ANGPT1" "ANGPTL4" ...

From this list, I generate two tables: One with the unique gene names

df2 <- data.frame(gene=unique(unlist(list_filter)))

head(df2)
     gene
1    AAK1
2    ABI3
3   ADCK5
4     AGT
5    AKT1
6 ALOX15B

dim(df2)
[1] 671   1

One is simply a "dataframe" version of the list. With every gene in the signature and the name of every signature (set).

df1 <- lapply(list_filter,function(x){
  data.frame(gene = x)
}) %>% 
  bind_rows(.id = "path")

head(df1)
             path    gene
1 CellAge_Induces    AAK1
2 CellAge_Induces    ABI3
3 CellAge_Induces   ADCK5
4 CellAge_Induces     AGT
5 CellAge_Induces    AKT1
6 CellAge_Induces ALOX15B

dim(df1)
[1] 821   2

now I iterate the search of each unique gene name and save the identity of the signatures in a column.

df_int <- lapply(df2$gene,function(x){
  # pull the name of the intersections
  intersection <- df1 %>% 
    dplyr::filter(gene==x) %>% 
    arrange(path) %>% 
    pull("path") %>% 
    paste0(collapse = "|")
  
  # build the dataframe
  data.frame(gene = x,int = intersection)
}) %>% 
  bind_rows()

head(df_int,n=20)
      gene                                                            int
1     AAK1                                                CellAge_Induces
2     ABI3                                                CellAge_Induces
3    ADCK5                                                CellAge_Induces
4      AGT                                                CellAge_Induces
5     AKT1                                                CellAge_Induces
6  ALOX15B                                                CellAge_Induces
7       AR                                                CellAge_Induces
8   ARPC1B                                                CellAge_Induces
9    ASF1A                                                CellAge_Induces
10     AXL                                        CellAge_Induces|senmayo
11 BHLHE40                                                CellAge_Induces
12     BLK                                                CellAge_Induces
13    BRAF                                                CellAge_Induces
14    BRD7                                                CellAge_Induces
15    CAV1                                                CellAge_Induces
16   CCND1                          CellAge_Induces|FRIDMAN_SENESCENCE_UP
17   CDK18                                                CellAge_Induces
18  CDKN1A CellAge_Induces|FRIDMAN_SENESCENCE_UP|REACTOME_SENESCENCE_SASP
19  CDKN1C                          CellAge_Induces|FRIDMAN_SENESCENCE_UP
20  CDKN1B                       CellAge_Induces|REACTOME_SENESCENCE_SASP

dim(df_int)
[1] 671   2

the dataframe can be summarised and compared to the output provided by calling

df_int %>% 
  group_by(int) %>% 
  summarise(n=n()) %>% 
  arrange(desc(n))
# A tibble: 47 × 2
   int                                 n
   <chr>                           <int>
 1 CellAge_Induces                   126
 2 CellAge_Inhibits                  110
 3 REACTOME_SENESCENCE_SASP           95
 4 ISM_SCORE                          93
 5 senmayo                            77
 6 FRIDMAN_SENESCENCE_UP              44
 7 ISM_SCORE|MOSERLE_IFNA_RESPONSE    27
 8 CLASSICAL_SASP|senmayo             12
 9 CLASSICAL_SASP                      8
10 SAEPHIA_CURATED_SASP                8
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows


upset(fromList(list_filter),nsets = 10)

Canella answered 3/11, 2022 at 22:8 Comment(2)

I was looking for this so urgently, thank you very much! – Leverrier 10/5, 2023 at 13:32

This one works perfectly. As a note it will only work on characters, without saces or other special characters, at least in my case. – Quiche 29/5 at 15:52

Here is my take at extracting the different intersections together with the list of elements in them.

The main idea is to paste all the 0's and 1's from the binary table to create unique identifiers for each intersection and them use the dplyr::group_by function to extract the information

data <- data.frame(
  entry = paste0("Entry.", 1:10),
  "A" = c(0,0,0,0,1,0,1,1,0,0),
  "B" = c(1,0,0,0,1,1,1,1,1,0),
  "C" = c(1,1,1,1,0,0,1,0,1,1)
)

# NOT REQUIRED. Only to confirm that upset works with these data
upset(data)

You can then identify the intersections by pasting all the binary columns. I use the unite convenience function for this:

NB: you may have to change this depending on whether your data has row names or a column with names

data_with_intersection <- data %>%
  unite(col = "intersection", -c("entry"), sep = "")

From here, you can easily calculate the size of each intersection:

# Table of intersections and the number of entries
data_with_intersection %>%
  group_by(intersection) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

Or even extract the list of entries/elements in each intersection:

# List of intersections and their entries
data_with_intersection %>%
  group_by(intersection) %>%
  summarise(list = list(entry)) %>%
  mutate(list = setNames(list, intersection)) %>%
  pull(list)

Meteorite answered 19/1, 2022 at 18:31 Comment(1)

fantastic solution can I get here names instead of 0 and 1 such as ABC in place of 0 and 1? – Caddaric 29/8, 2022 at 11:18