R: K Means Clustering vs Community Detection Algorithms (Weighted Correlation Network) - Have I overcomplicated this question?
Asked Answered
C

3

6

I have data that looks like this: https://i.stack.imgur.com/HmpRl.jpg

The first dataset is a standard format dataset which contains a list of people and their financial properties.

The second dataset contains "relationships" between these people - how much they paid to each other, and how much they owe each other.

I am interested learning more about network and graph based clustering - but I am trying to better understand what type of situations require network based clustering, i.e. I don't want to use graph clustering where its not required (avoid a "square peg round hole" type situation).

Using R, first I created some fake data:

library(corrr)
 library(dplyr) 
library(igraph) 
library(visNetwork)
 library(stats)

# create first data set

Personal_Information <- data.frame(

"name" = c("John", "Jack", "Jason", "Jim", "Julian", "Jack", "Jake", "Joseph"),

"age" = c("41","33","24","66","21","66","29", "50"),

"salary" = c("50000","20000","18000","66000","77000","0","55000","40000"),

"debt" = c("10000","5000","4000","0","20000","5000","0","1000"

)


Personal_Information$age = as.numeric(Personal_Information$age)
Personal_Information$salary = as.numeric(Personal_Information$salary)
Personal_Information$debt = as.numeric(Personal_Information$debt)
create second data set
Relationship_Information <-data.frame(

"name_a" = c("John","John","John","Jack","Jack","Jack","Jason","Jason","Jim","Jim","Jim","Julian","Jake","Joseph","Joseph"),
"name_b" = c("Jack", "Jason", "Joseph", "John", "Julian","Jim","Jim", "Joseph", "Jack", "Julian", "John", "Joseph", "John", "Jim", "John"),
"how_much_they_owe_each_other" = c("10000","20000","60000","10000","40000","8000","0","50000","6000","2000","10000","10000","50000","12000","0"),
"how_much_they_paid_each_other" = c("5000","40000","120000","20000","20000","8000","0","20000","12000","0","0","0","50000","0","0")
)

Relationship_Information$how_much_they_owe_each_other = as.numeric(Relationship_Information$how_much_they_owe_each_other)
Relationship_Information$how_much_they_paid_each_other = as.numeric(Relationship_Information$how_much_they_paid_each_other)

Then, I ran a standard K-Means Clustering algorithm (on the first dataset) and plotted the results:

# Method 1 : simple k means analysis with 2 clusters on Personal Information dataset
cl <- kmeans(Personal_Information[,c(2:4)], 2)
plot(Personal_Information, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

This is how I normally would have treated this problem. Now, I want to see if I can use graph clustering with this type of problem.

First, I created a weighted correlation network (http://www.sthda.com/english/articles/33-social-network-analysis/136-network-analysis-and-manipulation-using-r/)

First, I created the weighted correlation network (using the first dataset):

res.cor <- Personal_Information[, c(2:4)] %>%  
    t() %>% correlate() %>%            
    shave(upper = TRUE) %>%            
    stretch(na.rm = TRUE) %>%          
  filter(r >= 0.8)       

graph <- graph.data.frame(res.cor, directed=F)
graph <- simplify(graph)
plot(graph)

Then, I ran the graph clustering algorithm:

#run graph clustering (also called communiy dectection) on the correlation network
 fc <- fastgreedy.community(graph)
 V(graph)$community <- fc$membership
 nodes <- data.frame(id = V(graph)$name, title = V(graph)$name, group = V(graph)$community)
 nodes <- nodes[order(nodes$id, decreasing = F),]
 edges <- get.data.frame(graph, what="edges")[1:2]

 visNetwork(nodes, edges) %>%
     visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

This seems to work - but I am not sure if it is the optimal way to approach this porblem.

Can someone provide some advice? Have I overcomplicated this problem?

Thanks

Cyclopedia answered 15/11, 2020 at 21:21 Comment(1)
If you are asking about "math and logic" and not programming, then your question is not appropriate for Stack Overflow. Instead you should ask someplace like Cross Validated or Data ScienceGomer
K
4

(first some background to understand the nature of the problem from what you describe) You have 2 datasets and therefore produce 2 datastructures: Personal_Information and Relationship_Information. You have a set of entities which appear to be unique as there are no name repetitions in Personal_Information, therefore if you know that these entities have connectivity information between themselves we can refer to them as nodes in a network, where their interconnectivity can produce a network where there are communities that a community detection algorithm can uncover/allocate/detect. So,

  • Personal_Information, describes each person (node)
  • Relationship_Information, describes their connectivity/relationship (edges)

In the example usage of this information you supplied in your code, you appear to only be using the graph data which is built only from Personal_Information res.cor <- Personal_Information[, c(2:4)] %>% ... and not Relationship_Information. This means that you are building relationship between the variables of each person that is intrinsic to them as nodes in a network rather than what data they have produces as a result of their connected interactions. To understand what you are doing here, your direction is like saying; I am going to produce a network between the personality traits of people and ignore their associations between themselves even though I have the data. I am going to look at how those personality features correlate between themselves and then see which groups of feature values have values that follow each other (correlate in groups)

So finding the correlations between the features of nodes (persons) over multiple people is fine, and then producing a matrix of that information is also ok, and then producing a graph/network from that is also ok. The result of that graph you produce (you call graph), via fc <- fastgreedy.community(graph) is that what you obtain is; which groups of variables of each person are co-correlated. Eg, var1 and var2 have a strong correlation between themsevles, but var2 and var3 have a strong negative correlation between themselves, so the edge between var2 and var3 is going to push them to be in separate communities and also push var1 to be in a separate community from var3 as it is tied strongly to var2 (close friend). How is this information useful? It can help you understand how the variables exist as groups so that if you have a new person who had a low value of var2 and you don't know the value of var1 or var3; you'd expect that var1 will be low as well and that var3 is high. If you took the covariance of the person data, you could take the eigenvectors and effectively do PCA which gives you a vector with information of this nature.

But, this is not producing information regarding the network edges you observed/measured in your Relationship_Information data which describes the community data information and not the node data. This dataset looks like an adjacency list, which is a datastructure that lists the first 2 columns as the node source in col1, node destination in col2, and the edge weight in col3, and if you have the same names of nodes in col2 and col1 (swapped) with the same edge weight the network has symmetric edges (undirected), else it is directed. Since your data, has 2 edges columns (col3 and col4) you can either produce one network with col1,col2,col3 and another with col1,col2,col4, or... you can produce a network with

  • adj_list1 = col1,col2,(col3-col4), using var names in that spreadsheet $adj_list1 = name1,name2,(how_much_they_paid_each_other-how_much_they_owe_each_other)$
  • or adj_list1 = col1,col2,(col3/col4) $adj_list1 = name1,name2,(how_much_they_paid_each_other / how_much_they_owe_each_other)$

and that is up to you how you define an edge with those values. You want to produce a network from adj1 or adj2 and then from that network apply the community detection. Think of it like those payments in that dataset as interactions similar to those on social media as being likes and mentions connecting people together. The community results here show the labels for the communities that are economically tied according to which edges you use, and you can apply an algorithm like the Louvain algorithm to do so.

But this does not use both the nodes' data and the edge data (person data and exchange data). They are answering different questions.

Applying K-Means to the node feature data is answering a different question from the community detection algorithm.

  • K-Means, these variables values for each person are not evenly distributed, they are focused into K dense regions where the inbetween regions have sparse samples. So we have types
  • Community detection, ignoring what these people features are, lets group people together on their interactions to see how many groups there are so if people exchange money between themselves they are doing it with focus on a sub-group.

So those questions are independent using clustering and community detection as they use independently collected datasets. The spread sheets do not depend upon each other or does the data. This does not mean that they data has no crossing information. You can have those features affect the edges. So when presenting it, you have 2 separate investigations.

(Another answer above mentions fusion based approaches to analyzing the data with both the node data and edge data together, but it does not appear as if that is your question. are you trying to use both datasets together? If so, the easiest approach is to use an approach with good implementations out there and 'graph neural networks' like the SGC, simple graph convolutional neural network, is a good recommendation, although it sounds intimidating, you'd feed it the adjacency matrix made from the payment network you make and then the node attributes/features. Python's DGL library is great for this. You can do it unsupervised with scaled data if you want.)

Killigrew answered 22/2, 2021 at 16:36 Comment(0)
O
3

Perhaps you might be interested in reading about "Fusion Based Approaches for Community Detection" (https://link.springer.com/chapter/10.1007/978-3-030-44584-3_24). These fusion based methods apparently have been specifically designed to take into consideration node attributes.

This might be able to help as well: https://www.nature.com/articles/srep30750

Olympian answered 27/11, 2020 at 20:8 Comment(0)
N
1

I am trying to better understand what type of situations require network based clustering

This is completely dependent on your problem domain and the questions you are asking. You really need to have focused questions about the data that you are trying to answer. That being said, there is an set of clustering techniques you can apply that can use both edge weights and node attributes: Hierarchical Clustering.

Edge and node attributes come into play in how you determine the similarity/dissimilarity matrix which drives the clustering. Note that there are many, many implementations of this, take your time and find one that you can apply to your data and problem set.

Navar answered 27/11, 2020 at 15:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.