Which programming structure for clustering algorithm

Asked 20/8, 2015 at 6:49 Answered 1/9, 2015 at 1:56

Solved python data-structures cluster-analysis hierarchical-clustering

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):

Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.

First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.
Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.
At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.

Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.

EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.

# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]

# Define correlation coefficient as distance of choice
def pearson(v1, v2):
  # Simple sums
  sum1 = sum(v1)
  sum2 = sum(v2)
  # Sums of the squares
  sum1Sq = sum([pow(v, 2) for v in v1])
  sum2Sq = sum([pow(v, 2) for v in v2]) 
  # Sum of the products
  pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
  # Calculate r (Pearson score)
  num = pSum - (sum1 * sum2 / len(v1))
  den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
  if den == 0: return 0
  return num / den


# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
  for j in range(i + 1, len(data)):
     # Compute distance for each pair
     dist_curr = pearson(data[i], data[j])
     # Store distance in dict
     dist[(i, j)] = dist_curr
     # Store max distance
     if dist_curr > max_dist:
       max_dist = dist_curr

EDIT 2: Pasted below are functions from Dschoni's answer.

# Euclidean distance
def euclidean(x,y):
  x = numpy.array(x)
  y = numpy.array(y) 
  return numpy.sqrt(numpy.sum((x-y)**2))

# Create matrix
def dist_mat(data):
  dist = {}
  for i in range(len(data)):
    for j in range(i + 1, len(data)):
      dist[(i, j)] = euclidean(data[i], data[j])
  return dist


# Returns i & k for max distance
def my_max(dict):
    return max(dict)

# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
  list1.append(i)
  list2.append(k)
  for j in range(len(rcd)):
    if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
      list1.append(j)
    else:
      list2.append(j)

EDIT 3: When I run the code provided by @Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.

import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()

idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)

The result is as follows:

[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16, 39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41], [5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44], [31], [32], [19], [30], [35], [9, 47]]

Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:

# Add small random noise
for key in actual:    
  actual[key] +=  np.random.normal(0, 0.005)

Any idea how to solve this problem properly?

Seicento answered 20/8, 2015 at 6:49 Comment(14)

I'm not entirely sure how to answer the question as asked. Could you expand on what you mean by "programming structure?" If you're referring to data structures, what part of your algorithm needs structures, and what are the requirements for the data structures? If all you want is to repeatedly split a collection of elements without caring about how the data is ultimately represented, an array or linked list might work. Alternatively, you could consider some form of tree (like a binary search tree). – Epanaphora 22/8, 2015 at 18:6

I just need to repeatedly split the data points. I suspect that an array should be enough. Have you any idea where/how to start? – Seicento 22/8, 2015 at 18:56

do you have a programming language in mind? You have an algorithm, so you just need to convert the steps of the algorithm into some code. I'd say start by writing a function. – Epanaphora 22/8, 2015 at 19:7

I program in Python, but whichever language should be OK. I just need help with 2. and 3. step. – Seicento 22/8, 2015 at 19:12

Start by writing step 1 in python, and include that in your question. This will make your question more complete and answerable. – Epanaphora 22/8, 2015 at 19:13

I attach the code from my initial attempts. – Seicento 22/8, 2015 at 19:33

Not an answer but sum([pow(v, 2) for v in v1]) creates a list first then sums, sum(pow(v, 2) for v in v1) is sufficient – Bozo 23/8, 2015 at 11:14

Is your goal having a hierarchy of clusters with 2 data points in each fork? – Percussive 24/8, 2015 at 9:8

@Percussive yes, that is the goal. – Seicento 24/8, 2015 at 9:15

Some more logic questions: Could you explain step 2 a little bit more? You want to sort the data into subclusters in a way, that the data lies closer to the first element. What if the distance is equal? What if, there is more than one maximum distance? Let's use euclidean for a sample: data=[1,2,2.5,4,5] output1=[1,2,2.5] output2=[4,5] output1.1=[1] output1.2=[2,2.5] yes? no? – Percussive 24/8, 2015 at 9:22

If distance is equal then we decide by random choice (same if there is more than one maximum distance). Your toy example is plausible. – Seicento 24/8, 2015 at 9:27

You don't have to calculate the distance more than once for each pair. Especially if the calculation is more complex, always work with the distance map. I'm updating my answer below in a few minutes. – Percussive 24/8, 2015 at 11:19

The question is not a programming question, but more an algorithm question: How should data-points be separated, when they have no distance? In my example below the dataset [1,1,1] returns the output [[0,1,2]] BY DESIGN as points with no distance always meet the threshold == 0 constraint. If you want to have an ultimate separation, it's no longer a question of clustering. – Percussive 31/8, 2015 at 11:57

just curious. Are you trying to learn to implement the algorithm or just using the algorithm for your data. If just using, why not try scikit-learn.org/stable . They have all the clustering algorithms implemented. – Cauda 4/9, 2015 at 12:35

A proper working example for the euclidean distance:

import numpy as np
#For random number generation


def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
    dist={}
    for i in range(len(l)):
        for k in range(i+1,len(l)):
            dist[(i,k)]=abs(l[i]-l[k])
    return dist

def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
    maximum = max(distance_dict.values())
    max_key = [key for key, value in distance_dict.items() if value == maximum]
    if len(max_key)>1:
        random_key = np.random.random_integers(0,len(max_key)-1)
        return (max_key[random_key],)
    else:
        return max_key

def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
    new={}
    for i in range(len(index_list)):
        for k in range(i+1,len(index_list)):
            m = index_list[i]
            n = index_list[k]
            new[(m,n)]=distance_dict[(m,n)]
    return new

def sort(distance_dict,idl,threshold=4):
    result=[idl]
    i=0
    try:
        while True:
            if len(result[i])>=2:
            actual=construct_new_dict(dist,result[i]) 
                act_max=maximum(actual)
                if distance_dict[act_max[0]]>threshold:
                    j = act_max[0][0]
                    k = act_max[0][1]
                    result[i].remove(j)
                    result[i].remove(k)
                    l1=[j]
                    l2=[k]
                    for iterr in range(len(result[i])):
                        s = result[i][iterr]
                        if s>j:
                            c1=(j,s)
                        else:
                            c1=(s,j)
                        if s>k:
                            c2=(k,s)
                        else: 
                            c2=(s,k)
                        if actual[c1]<actual[c2]:
                            l1.append(s)
                        else:
                            l2.append(s)
                    result.remove(result[i])
    #What to do if distance is equal?
                    l1.sort()
                    l2.sort()
                    result.append(l1)
                    result.append(l2)
                else:
                    i+=1
            else:
                i+=1
    except:
        return result


#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)

I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.

Percussive answered 24/8, 2015 at 9:42 Comment(12)

I pasted the appropriate functions above. Could you please help me with recursion in the sort function. – Seicento 24/8, 2015 at 10:43

Do you need the clusters in hierachical order? Or just the clusters? – Percussive 24/8, 2015 at 11:15

The result is the indices of the data list in input clustered together. If you need the data values, simply set each value in the index list to the value of the input list. – Percussive 24/8, 2015 at 13:47

Can you let me know if this is, what you're looking for? – Percussive 24/8, 2015 at 14:0

Yes, this is exactly what I look for. Thanks. – Seicento 24/8, 2015 at 14:2

Also have a look at docs.scipy.org/doc/scipy-0.15.1/reference/cluster.html – Percussive 24/8, 2015 at 14:12

I load iris dataset. When threshold is enough high, all the data points are in the same (big) cluster. But when I lower the threshold I can't achieve that each data point is in its own cluster. Any idea? – Seicento 24/8, 2015 at 14:59

My only idea so far is, that a lot of distances are equal in real datasets and the above code doesn't know what to do (see my comment in the sort function) so it just sorts points together that doesn't belong in the same cluster. If you set the threshold to 0, every point should ultimately be in its own cluster. – Percussive 25/8, 2015 at 13:28

Maybe we should add "diameter checking step" as described in the 4th paragraph here? – Seicento 25/8, 2015 at 15:4

If you need further help above the questions here, you can more easily contact me via e-mail in my profile. – Percussive 25/8, 2015 at 16:20

Just for further input: A test dataset of 'a = [1,1,1]' seems to have the exact same problem. So maybe the same numbers should be treated as same data points or randomly distributed. – Percussive 25/8, 2015 at 16:23

Let us continue this discussion in chat. – Percussive 25/8, 2015 at 16:35

Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function.

If Dmax > T then divide single cluster in two

Your description doesn't necessarily creates n clusters.
If a cluster has two records which has a distance less than T,
they will be clustered together (am I missing something?)

Paronym answered 1/9, 2015 at 1:56 Comment(0)

Recommended topics

Hot tags