What is the significance of the semi clustering formula in the Google Pregel paper?

Asked 2/7, 2012 at 12:42 Answered 7/7, 2012 at 23:30

Solved algorithm cluster-analysis graph-theory clique

Semi clustering algorithm is mentioned in the Google Pregel paper. The score of a semi cluster is calculated using the below formula

enter image description here

where

Ic is sum of the weights of all the internal edges
Bc is the sum of the weights of all the boundary edges
Vc is the number of vertices in the semi cluster and
fb is the boundary edge score factor (user defined between 0 and 1)

The algorithm was pretty straight forward, but I could not understand how the above formula was arrived. Note that the denominator is the number of edges possible between Vc number of vertices.

Could someone please explain it?

Riggs answered 2/7, 2012 at 12:42 Comment(5)

FWIW, you might (also) consider posting on cs.stackexchange.com. You might get better / different results there... – Triable 5/7, 2012 at 2:5

Thanks Reuben - I was not aware of that group. – Riggs 5/7, 2012 at 2:43

It seems a fairly standard optimisation criterion, of which there are hundreds of variants. In this case it rewards intra-cluster edges, punishes inter-cluster edges, and normalises by cluster size squared, so that clusters can probably only grow to a size commensurate with node degree locally in the cluster. It does not seem intrinsically tied to the algorithm itself as far as I can see and simply favours simplicity (in a good way I would say). – Oligarch 5/7, 2012 at 8:37

Could it be that Vc is actually the number of vertices in the semi cluster? If it really is the number of edges, then I don't understand how your later remark about the denominator makes sense. – Pros 8/7, 2012 at 11:37

@j_random_hacker: your interpretation is correct. – Oligarch 9/7, 2012 at 13:42

The score makes sense if you think about the quantity that it is meant to capture.

The problem being addressed here is figuring out what's the best way to place vertices of a graph into semi-clusters (simply a group of vertices where each vertex can be in more than one semi-cluster) with some upper bound on the total number of semi-cluster. So one method of finding the "best" way is to assign a score to any potential semi-cluster (in other words, to any arbitrary group of vertices). Then the problem becomes that of maximizing the total score.

So, semi-cluster are meant to capture cliques in a graph. For example in a social graph, a semi-cluster might be the members of a high school basketball team.

Thus more interior edges equates to a "better" semi-cluster. This explains the I_c in the numerator. Similarly, you want to have very few boundary edges since if there are a lot of boundary edges, then that means there is likely to be a better semi-group containing the one you are examining. This gives the -f_b * B_c in the numerator. f_b is simply a scaling factor so that you can tune how much penalty you want to assign the boundary edges.

The denominator is also kind of a scaling factor. It is used to normalize the semi-cluster scores so that small clusters don't get completely dominated by larger ones. An extreme example of this is if you consider the semi-group of everyone in the world. Clearly there are no boundary edges and tons of interior edges, but it is undoubtably a less useful semi-group than the high school basketball team.

Lactobacillus answered 5/7, 2012 at 8:31 Comment(0)

It is related to cliques.

V_c * (V_c - 1) is the number of edges in a clique of size V_c.

So if you take the sum over all edges in the group I_c, this is the appropriate normalization to get an arithmetic mean.

I.e. I_c / (V_c * (V_c - 1)) is the mean weight inside the clique.

Now the - f_B * B_c term is a penalty for outgoing edges. IMHO, it should only be divided by V_c, but that is personal taste, as I'd assume the expected outgoing edges to scale with the number of clique members, not with the square of this.

Upkeep answered 7/7, 2012 at 23:30 Comment(0)

Recommended topics

Hot tags