How to calculate a (co-)occurrence matrix from a data frame with several columns using R?
Asked Answered
B

3

7

I'm a rookie in R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows. I want to create a (co-)occurrence matrix based on the interaction between countries. However, I want to count the number of interactions by the total number of an object.

Basic Example of Aspired Outcome

If in one row "England" occurs three times and "China" only one time, the result should be the following matrix.

         England  China
England    3        3
China      3        1

Reproducible example

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

Accordingly, an example data frame currently looks like this:

ID  V1       V2       ...   V32
1   England  Greece         USA
2   England  England        China
3   China    Greece         Greece
4   England  England        England
.
.
.

Aspired outcome

I want to count (co-)occurrences row-wise and independent of order to get a (co-)occurrence matrix that accounts for low frequencies of edge loops (e.g. England-England), which leads to the following result:

         China   England   Greece   USA

China    2        2         2        0

England  2        6         1        1

Greece   2        1         3        1

USA      0        1         1        1

What has been tried so far

I've used igraph to get an adjacency matrix with co-occurrences. However, it calculates - as supposed to - not more than two interactions of the same two objects, leaving me with values far below actual frequency of objects by row/publication in some cases.

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

# remove ID column

df[1] <- list(NULL)

# calculate co-occurrences and return as dataframe

library(igraph)
library(Matrix)

countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))

         China   England   Greece   USA

China    0        0         1        0

England  0        2         1        0

Greece   1        1         0        0

USA      0        0         0        0

I assume there has to be an easy solution using base and/or dplyr and /or table and/or reshape2 similar to [1], [2], [3], [4] or [5] but nothing has done the trick so far and I was not able to adjust the code to my needs. I've also tried to use [6] as a basis, however, the same issue applies here, too.

library(tidry)
library(dplyr)
library(stringr)


# collapse observations into one column

df2 <- df %>% unite(concat, V1:V32, sep = ",")

# calculate weights

df3 <- df2$concat %>%
  str_split(",") %>%
  lapply(function(x){
    expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
  }) %>%
  bind_rows

df4 <- apply(df3[, -5], 1, sort) %>%
  t %>%
  data.frame(stringsAsFactors = FALSE) %>%
  mutate(w = df3$w)

I'd be glad if someone could point me in the right direction.

Bash answered 7/1, 2020 at 16:54 Comment(8)
Can you explain your input dataframe more, it is not an edgelist to me. An edge list should be 2 columns with node IDs (countries in your case) and an edge attributes. Does that mean that for each row of your df you draw an edge between all pairs of countries?Capelin
Your data look more like a transaction list see the arules packages for functions that may help you work with your data without too much extra manipulation.Capelin
Thanks, I'll take a look! One row represents one publication. Each column represent a country affiliation of the participating authors' institution. Originally one publication (and the related information) was distributed by ID over several rows. I've summarized to one row by ID, created separate columns for each piece of information (author, country, ...) separating observations by comma, created different data sets containing ID and information of one column and transformed from comma separation to separation by column.Bash
I want to draw edges between pairs of countries. The goal is to calculate co-occurrences between pairs of countries reflecting occurrences/frequencies of each object. From my understanding England-England-USA-Greece would usually be regarded in a matrix in the way that England-England, England-USA etc. get a weight of 1 while e.g. USA-USA would get a weight of 0. However, I want to count by occurrence so England-England results in a weight of 2 and USA-USA results in a weight of 1. In this case a weight of 0 would only be possible if two countries do not appear in one row together.Bash
In your "basic example", why the entry for China-England or England-China is 3? Could you explain a bit?Binghi
Is my understanding correct that there seems to be two processes going in for your final outcome; that is the off-diagonals are when countries co-occur / there is an edge between them but the diagonals are just raw counts?Transition
@Transition - Yes, that's already been very helpful! Thank you! :)Bash
@Binghi - Well, the main idea is to count pairwise. I want to allocate the occurrence value of a string/observation of two strings/observations that appears the most to the other string it is connected to.Bash
W
4

There may be better ways to do this, but try:

library(tidyverse)

df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>% 
crossprod(., .) 

df_diag <- df %>% 
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>% 
diag()

diag(df1) <- df_diag 

df1

Country   China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1
Wenoa answered 7/1, 2020 at 21:47 Comment(2)
Thank you very much! I'll need some time to figure out what the code is doing but it's working fine.Bash
Unless I'm missing something, the same approach in less steps could be: tab = table(df$ID[row(df[-1])], as.matrix(df[-1])); df1 = crossprod(tab); diag(df1) = colSums(tab); df1Frith
O
1

Here is a way using dplyr and tidyr packages, the whole idea lies in creating a dataframe with row-wise occurrence of each country then joining it on itself.

library(dplyr)

# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4), 
                 V1 = c("England", "England", "China", "England"),
                 V2 = c("Greece", "England", "Greece", "England"),
                 V32 = c("USA", "China", "Greece", "England"),
                 stringsAsFactors = FALSE)

# Get the occurance of each country in every row.
row_occurance <- 
  df %>%
  tidyr::gather(key = "identifier", value = "country", -ID) %>%
  group_by(ID, country) %>%
  count()

row_occurance %>%
  # Join row_occurance on itself to simulate the matrix
  left_join(row_occurance, by = "ID") %>%
  # Get the highest occurance row wise, this to handle when country
  # name is repeated within same row
  mutate(Occurance = pmax(n.x, n.y)) %>%
  # Group by 2 countries
  group_by(country.x, country.y) %>%
  # Sum the occurance of 2 countries together
  summarise(Occurance = sum(Occurance)) %>%
  # Spread the data to make it in matrix format
  tidyr::spread(key = "country.y", value = "Occurance", fill = 0)

# # A tibble: 4 x 5
# # Groups:   country.x [4]
# country.x China England Greece   USA
# <chr>     <dbl>   <dbl>  <dbl> <dbl>
# China         2       2      2     0
# England       2       6      1     1
# Greece        2       1      3     1
# USA           0       1      1     1
Odaodab answered 7/1, 2020 at 22:4 Comment(1)
Thank you very much! In particular, for the explanations.Bash
S
1

An option using base::table:

df <- data.frame(ID = c(1,2,3,4), 
    V1 = c("England", "England", "China", "England"),
    V2 = c("Greece", "England", "Greece", "England"),
    V3 = c("USA", "China", "Greece", "England"))

#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind, 
    by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]

#repeat data frame with columns swap so that 
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))

#tabulate pairs
tab <- table(pairs)

#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]

tab

output:

         V2
V1        China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1
Satyr answered 8/1, 2020 at 2:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.