I'm a rookie in R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows. I want to create a (co-)occurrence matrix based on the interaction between countries. However, I want to count the number of interactions by the total number of an object.
Basic Example of Aspired Outcome
If in one row "England" occurs three times and "China" only one time, the result should be the following matrix.
England China
England 3 3
China 3 1
Reproducible example
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
Accordingly, an example data frame currently looks like this:
ID V1 V2 ... V32
1 England Greece USA
2 England England China
3 China Greece Greece
4 England England England
.
.
.
Aspired outcome
I want to count (co-)occurrences row-wise and independent of order to get a (co-)occurrence matrix that accounts for low frequencies of edge loops (e.g. England-England), which leads to the following result:
China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
What has been tried so far
I've used igraph
to get an adjacency matrix with co-occurrences. However, it calculates - as supposed to - not more than two interactions of the same two objects, leaving me with values far below actual frequency of objects by row/publication in some cases.
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
# remove ID column
df[1] <- list(NULL)
# calculate co-occurrences and return as dataframe
library(igraph)
library(Matrix)
countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))
China England Greece USA
China 0 0 1 0
England 0 2 1 0
Greece 1 1 0 0
USA 0 0 0 0
I assume there has to be an easy solution using base
and/or dplyr
and /or table
and/or reshape2
similar to [1], [2], [3], [4] or [5] but nothing has done the trick so far and I was not able to adjust the code to my needs. I've also tried to use [6] as a basis, however, the same issue applies here, too.
library(tidry)
library(dplyr)
library(stringr)
# collapse observations into one column
df2 <- df %>% unite(concat, V1:V32, sep = ",")
# calculate weights
df3 <- df2$concat %>%
str_split(",") %>%
lapply(function(x){
expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
}) %>%
bind_rows
df4 <- apply(df3[, -5], 1, sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = df3$w)
I'd be glad if someone could point me in the right direction.
arules
packages for functions that may help you work with your data without too much extra manipulation. – Capelin