How to compare clusters?

Asked 25/7, 2013 at 19:19 Answered 22/11, 2019 at 9:2

Solved python algorithm cluster-analysis

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).

My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.

Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!

Lisk answered 25/7, 2013 at 19:19 Comment(13)

Parse into classes and compare objects of that class – Oglethorpe 25/7, 2013 at 19:22

I suppose you can have several families in one cluster, right? – Embrey 25/7, 2013 at 19:24

Using diff still might be the easiest solution. – Platinumblond 25/7, 2013 at 19:26

@TimPietzcker Yes, there could be more than one family per cluster. – Lisk 25/7, 2013 at 19:41

@MarkusUnterwaditzer I looked into diff but I'm worried it won't have the flexibility I need for ignoring the cluster numbers – Lisk 25/7, 2013 at 19:42

To find difference of two, first consider the common. With that as a guideline, create a parser in any language with string processing capability (and Python is indeed a good choice) to extract values into instance fields or simply parse the content into built-in data types like dict and let the tools dict offers to do the job for you. – Southwestwardly 25/7, 2013 at 19:43

@limelights not to sound stupid, but what are classes? is it like creating large groups? – Lisk 25/7, 2013 at 19:44

They are containers of behavior and data. – Oglethorpe 25/7, 2013 at 19:46

@wooztking Ideally I'd like to be able to say something like "70% of species were in the same genera across both files, but only 30% of genera were in the same families" (I made those stats up) or something like that. Do you think dict would be able to help me with that? – Lisk 25/7, 2013 at 19:48

@Lisk dict may or may not be able to. But the idea here is that if you're able to parse the data and represent them as dict, your first step is solid. Whether dict would satisfy your needs or not, really depends on the actual use cases you have. From the ones you gave, though vague, I'm confident that dict is capable to do it for you. Btw, @limelights I can't help but to +1 for your definition of classes. – Southwestwardly 25/7, 2013 at 19:50

lol, it was the most simple explanation i could muster. (Normally I explain them as blueprints as the objects created from the classes are the real container of data). – Oglethorpe 25/7, 2013 at 19:53

1. Reduce clusters to canonical format. 2. Sort the clusters in each file. 3. diff – Saleable 25/7, 2013 at 19:59

Have you read the Wikipedia page on cluster analysis? It discusses external evaluation measures, which compare two clusterings. – Crooks 26/7, 2013 at 12:30

Given:

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

Is this what you need?

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

To see the differences:

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

prints

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

Pomiferous answered 25/7, 2013 at 20:24 Comment(2)

+1 for a good solution even though I, myself, would have preferred to have seen a bit more of a challenge for the OP ;) – Oglethorpe 25/7, 2013 at 20:31

Thanks for posting this answer! This is close to what I need. The problem though is that when I run it, everything shows up under common elements because both files have the exact same content, it's just HOW that content is sorted that matters (make sense?). For example, in my first file, Cluster 4 contains the bacterium Sphingomonadaceae.Blastomonas.natatoria and Sphingomonadaceae.Sphingomonas.ursincola. In my second file, these two bacterium are in different clusters (11 and 23). So I'd want to be notified of this change. Hopefully that made sense! – Lisk 17/8, 2013 at 4:58

So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.

Note that this does not answer your full question but points you in one of the directions in the comments.

Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)

On to the fun stuffs! :)

class Cluster(object):
  '''
  This is a class that will contain your information about the Clusters.
  '''
  def __init__(self, number):
    '''
    This is what some languages call a constructor, but it's not.
    This method initializes the properties with values from the method call.
    '''
    self.cluster_number = number
    self.family_name = None
    self.bacteria_name = None
    self.bacteria = []

#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
  cluster = None
  clusters = []
  for index, line in enumerate(file):
    if line.startswith('Cluster'):
      cluster = Cluster(index)
      clusters.append(cluster)
    else:
      if not cluster.family_name:
        cluster.family_name = line
      elif not cluster.bacteria_name:
        cluster.bacteria_name = line
      else:
        cluster.bacteria.append(line)

I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2 You could copy this file into a .py file and run it directly from command line python bacteria.py for example.

Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)

Oglethorpe answered 25/7, 2013 at 20:21 Comment(0)

Given:

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

Is this what you need?

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

To see the differences:

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

prints

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

Pomiferous answered 25/7, 2013 at 20:24 Comment(2)

+1 for a good solution even though I, myself, would have preferred to have seen a bit more of a challenge for the OP ;) – Oglethorpe 25/7, 2013 at 20:31

You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.

The easiest way it to define a named tuple:

import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])

You can make in instance of this object like this:

b = Bacterium('Brucellaceae', 'Brucella', 'canis')

Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;

with open('cluster0.txt', 'r') as infile:
    lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
    # set family and genera.
    # if you detect a bacterium:
    bacteria.append(Bacterium(family, genera, species))

Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:

s = [b for b in bacteria if b.genera == 'Streptomycetaceae']

Alderney answered 25/7, 2013 at 20:33 Comment(0)

Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).

The library is called CluSim and can be found here: https://github.com/Hoosier-Clusters/clusim/

Alterant answered 22/11, 2019 at 9:2 Comment(1)

Please provide some essential details from link, because links are prone to get expired/redirect in future. – Suspensory 22/11, 2019 at 9:23

After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.

I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:

def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to 
   maximize alignment of elements within each cluster. This method is 
   described in in Menéndez, Héctor D. A genetic approach to the graph and 
   spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
   are consecutive integers starting with zero)

   INPUTS:
   clstr1 - The first clustering assignment
   clstr2 - The second clustering assignment

   OUTPUTS:
   clstr2_temp - The second clustering assignment with clusters renumbered to
   maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))

for i in range(K):
    for j in range(K):
        dcix = clstr1==i
        dcjx = clstr2==j
        dd = np.dot(dcix.astype(int),dcjx.astype(int))
        simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
    simdist_vec = np.reshape(simdist.T,(K**2,1))
    I = np.argmax(simdist_vec)
    xy = np.unravel_index(I,simdist.shape,order='F')
    x = xy[0]
    y = xy[1]
    mask[x,y] = 1
    simdist[x,:] = 0
    simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
    swapj = [swapJ[k]==i for i in clstr2]
    clstr2_temp[swapj] = swapI[k]
return clstr2_temp

Opec answered 25/2, 2018 at 3:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags