How to traverse a tree from sklearn AgglomerativeClustering?
Asked Answered
S

2

10

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix

It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh

I used the follow code to generate a hierarchical cluster:

import numpy as np
from sklearn.cluster import AgglomerativeClustering

matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
                                linkage="average", affinity="cosine")
model.fit(matrix)

To get the clusters for each term, I could have done:

for term, clusterid in enumerate(model.labels_):
    print term, clusterid

But how do I traverse the tree that the AgglomerativeClustering outputs?

Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?

Spiritualist answered 9/12, 2014 at 18:54 Comment(6)
The documentation would suggest looking at the children_ attribute of model.Hoberthobey
i used children_ and it's giving me lists of two nodes, it's not traversing but returning childrens and i've no idea what that is and the node number from the children goes beyond my number of nodes...Spiritualist
A full hierarchical clustering of n objects produces a tree with 2n - 1 nodes. As the documentation says: "Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children_[i - n_samples]". That should be sufficient information to traverse the tree.Hoberthobey
@jme, pardon the noobie-ness, but what does the documentation mean?Spiritualist
Each node in the tree is assigned an ID, i. If the ID is less than the number of input objects n_samples, then the node is a leaf. Otherwise it's an internal node, and it joins two other nodes. The two nodes joined by node i are found in children_[i - n_samples]. As an aside, if your goal is to convert this to a scipy dendrogram, why not just use scipy.cluster.hierarchy.linkage rather than sklearn?Hoberthobey
I'll answer this last question since I found this page because of the problem. scipy.cluster.hierarchy.linkage has a problem with throwing seg faults with large distance matrices. Apparently it is a known problem since 2012 that no one is interested in fixing. github.com/scipy/scipy/issues/2089Ezana
C
24

I've answered a similar question for sklearn.cluster.ward_tree: How do you visualize a ward tree from sklearn.cluster.ward_tree?

AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)

ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]

https://stackoverflow.com/a/26152118

Constipation answered 15/12, 2014 at 19:58 Comment(4)
is there a way to know which item is in the node?Spiritualist
how does it relate to the cluster labels model.labels_?Spiritualist
The node number is also an index for the data vector for each leaf of the tree. For example {'left': 1, 'right': 2, 'node_id': 10} node 10 has leaves 1 and 2 as children. X[1] is the data vector for leaf 1.Constipation
You can also do dict(enumerate(model.children_, model.n_leaves_)), which will give you a dictionary where the each key is the ID of a node and the value is the pair of IDs of its children.Jaquesdalcroze
A
4

Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).

on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.

up_merge tells what node_id merges into and what node_id must be combined to merge into that.

ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]

import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
    node_id = cluster["node_id"]
    members[node_id] = copy.deepcopy(members[cluster["left"]])
    members[node_id].extend(copy.deepcopy(members[cluster["right"]]))

on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})
Agathy answered 10/2, 2019 at 16:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.