I have data that looks like this: https://i.stack.imgur.com/HmpRl.jpg
The first dataset is a standard format dataset which contains a list of people and their financial properties.
The second dataset contains "relationships" between these people - how much they paid to each other, and how much they owe each other.
I am interested learning more about network and graph based clustering - but I am trying to better understand what type of situations require network based clustering, i.e. I don't want to use graph clustering where its not required (avoid a "square peg round hole" type situation).
Using R, first I created some fake data:
library(corrr)
library(dplyr)
library(igraph)
library(visNetwork)
library(stats)
# create first data set
Personal_Information <- data.frame(
"name" = c("John", "Jack", "Jason", "Jim", "Julian", "Jack", "Jake", "Joseph"),
"age" = c("41","33","24","66","21","66","29", "50"),
"salary" = c("50000","20000","18000","66000","77000","0","55000","40000"),
"debt" = c("10000","5000","4000","0","20000","5000","0","1000"
)
Personal_Information$age = as.numeric(Personal_Information$age)
Personal_Information$salary = as.numeric(Personal_Information$salary)
Personal_Information$debt = as.numeric(Personal_Information$debt)
create second data set
Relationship_Information <-data.frame(
"name_a" = c("John","John","John","Jack","Jack","Jack","Jason","Jason","Jim","Jim","Jim","Julian","Jake","Joseph","Joseph"),
"name_b" = c("Jack", "Jason", "Joseph", "John", "Julian","Jim","Jim", "Joseph", "Jack", "Julian", "John", "Joseph", "John", "Jim", "John"),
"how_much_they_owe_each_other" = c("10000","20000","60000","10000","40000","8000","0","50000","6000","2000","10000","10000","50000","12000","0"),
"how_much_they_paid_each_other" = c("5000","40000","120000","20000","20000","8000","0","20000","12000","0","0","0","50000","0","0")
)
Relationship_Information$how_much_they_owe_each_other = as.numeric(Relationship_Information$how_much_they_owe_each_other)
Relationship_Information$how_much_they_paid_each_other = as.numeric(Relationship_Information$how_much_they_paid_each_other)
Then, I ran a standard K-Means Clustering algorithm (on the first dataset) and plotted the results:
# Method 1 : simple k means analysis with 2 clusters on Personal Information dataset
cl <- kmeans(Personal_Information[,c(2:4)], 2)
plot(Personal_Information, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
This is how I normally would have treated this problem. Now, I want to see if I can use graph clustering with this type of problem.
First, I created a weighted correlation network (http://www.sthda.com/english/articles/33-social-network-analysis/136-network-analysis-and-manipulation-using-r/)
First, I created the weighted correlation network (using the first dataset):
res.cor <- Personal_Information[, c(2:4)] %>%
t() %>% correlate() %>%
shave(upper = TRUE) %>%
stretch(na.rm = TRUE) %>%
filter(r >= 0.8)
graph <- graph.data.frame(res.cor, directed=F)
graph <- simplify(graph)
plot(graph)
Then, I ran the graph clustering algorithm:
#run graph clustering (also called communiy dectection) on the correlation network
fc <- fastgreedy.community(graph)
V(graph)$community <- fc$membership
nodes <- data.frame(id = V(graph)$name, title = V(graph)$name, group = V(graph)$community)
nodes <- nodes[order(nodes$id, decreasing = F),]
edges <- get.data.frame(graph, what="edges")[1:2]
visNetwork(nodes, edges) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
This seems to work - but I am not sure if it is the optimal way to approach this porblem.
Can someone provide some advice? Have I overcomplicated this problem?
Thanks