How to return top n biggest cluster in Neo4j?
Asked Answered
I

1

5

in my database, the graph looks somehow like this:

cluster

I want to find the top 3 biggest cluster in my data. A cluster is a collection of nodes connected to each other, the direction of the connection is not important. As can be seen from the picture, the expected result should have 3 clusters with size 3 2 2 respectively.

Here is what I came up with so far:

MATCH (n)
RETURN n, size((n)-[*]-()) AS cluster_size
ORDER BY cluster_size  DESC
LIMIT 100

However, it has 2 problems:

  1. I think the query is wrong because the size() function does not return the number of nodes in a cluster as I want, but the number of sub-graph matching the pattern instead.
  2. The LIMIT clause limits the number of nodes to return, not taking the top result. That's why I put 100 there.

What should I do now? I'm stuck :( Thank you for your help.


UPDATE

Thanks to Bruno Peres' answer, I'm able to try algo.unionFind query in Neo4j Graph Algorithm. I can find the size of my connected components using this query:

CALL algo.unionFind.stream()
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component DESC LIMIT 20;

And here is the result: Connected Components

But that's all I know. I cannot get any information about the nodes in each component to visualize them. The collect(nodeId) takes forever because the top 2 components are too large. And I know it doesn't make sense to visualize those large components, but how about the third one? 235 nodes are fine to render.

Interblend answered 6/3, 2018 at 15:42 Comment(0)
G
6

I think you are looking for Connected Componentes. The section about connected components of Neo4j Graph Algorithms User Guide says:

Connected Components or UnionFind basically finds sets of connected nodes where each node is reachable from any other node in the same set. In graph theory, a connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the graph.

If this is your case you can install Neo4j Graph Algorithms and use algo.unionFind. I reproduced your scenario with this sample data set:

create (x), (y),
(a), (b), (c),
(d), (e),
(f), (g),
(a)-[:type]->(b), (b)-[:type]->(c), (c)-[:type]->(a),
(d)-[:type]->(e),
(f)-[:type]->(g)

Then running algo.unionFind:

// call unionFind procedure
CALL algo.unionFind.stream('', ':type', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
WITH setId, collect(nodeId) as nodes
// order by the size of nodes list descending
ORDER BY size(nodes) DESC
LIMIT 3 // limiting to 3
RETURN setId, nodes

The result will be:

╒═══════╤══════════╕
│"setId"│"nodes"   │
╞═══════╪══════════╡
│2      │[11,12,13]│
├───────┼──────────┤
│5      │[14,15]   │
├───────┼──────────┤
│7      │[16,17]   │
└───────┴──────────┘

EDIT

From comments:

how can I get all nodeId of a specific setId? For example, from my screenshot above, how can I get all nodeId of the setId 17506? That setId has 235 nodes and I want to visualize them.

  1. Run call CALL algo.unionFind('', ':type', {write:true, partitionProperty:"partition"}) YIELD nodes RETURN *. This statement will create apartition` property for each node, containing the partition ID the node is part of.
  2. Run this statement to get the top 3 partitions: match (node) with node.partition as partition, count(node) as ct order by ct desc limit 3 return partition, ct.
  3. Now you can get all nodes of each top 3 partitions individually with match (node {partition : 17506}) return node, using the partition ID returned in the second query.
Greenberg answered 6/3, 2018 at 16:3 Comment(13)
Thank you for your answer! It's very helpful. Sorry for my late reply because I didn't have any chance to test it. I will let you know as soon as possible :)Interblend
I tried your query with a small modification. I use algo.unionFind.stream() because I don't care about the node label or relationship type. However, the query took so long that I could not get the result. My database has around 860k nodes and 2.5 million relationships. Is it too big for that kind of query? :(Interblend
I think the collect(nodeId) freezes the query because as I inspected, the largest component has around 833k nodes.Interblend
Hello @AnhTriet. Yes, maybe collect is degrading your query performance. Have you tried removing collect to check performance?Greenberg
Yes, without collect, it works fine. Please check my edited question :)Interblend
@AnhTriet Maybe you can use algo.unionFind instead of algo.unionFind.stream. This way you can specify a property name to store the partition id in each node. Then you can query your dataset to visualize the partitions by id.Greenberg
@AnhTriet something like CALL algo.unionFind('', ':type', {write:true, partitionProperty:"partition"}) YIELD nodes RETURN *. Then you can query the dataset using match(:Node {partition : 1000}) return node. Maybe a index for partition property can improve performance. @AnhTrietGreenberg
Is there any way to get all (or at least 1) nodeId from a setId?Interblend
@AnhTriet sorry, I did not understand. Can you explain with more details? Thanks.Greenberg
Sorry, I mean: how can I get all nodeId of a specific setId? For example, from my screenshot above, how can I get all nodeId of the setId 17506? That setId has 235 nodes and I want to visualize them.Interblend
Hi @AnhTriet, sorry for my late answer. I updated my answer, please take a look.Greenberg
Hi @Bruno, yes, thank you so much for your helpful support! :)Interblend
How can I return node labels, instead of nodeIds? Each node has one label.Jubal

© 2022 - 2024 — McMap. All rights reserved.