Python Implementation of OPTICS (Clustering) Algorithm

D

7

32

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).

I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.

Decastere answered 1/4, 2011 at 15:43 Comment(10)

did you look at SciPy: docs.scipy.org/doc/scipy/reference/cluster.html? – Needlewoman 1/4, 2011 at 15:53

@Needlewoman - Yes, I did. Actually, I was using the hierarchical clustering methods provided there (fcluster). But now, I'd like to switch to OPTICS. – Aretta 1/4, 2011 at 15:59

Is OPTICS such an unfamiliar/unknown algorithm that my question gets no attention? =( – Aretta 24/4, 2011 at 17:9

Have you considered using K-Means? – Kruger 28/4, 2011 at 4:2

@pyrony - I'm looking for a density-based clustering approach. Furthermore, specifying the 'k' value in K-Means is ill-suited for the kind of task I'm working on. – Aretta 28/4, 2011 at 22:45

There is some psudocode in the linked wikipedea article, have you tried translating this? If so what problems did you encounter? – Sigismond 29/4, 2011 at 11:40

No, I haven't tried it yet. I'm considering it as the last option :) – Aretta 29/4, 2011 at 17:3

For me it would be the first option - it's the best way to learn more about the implementation you eventually end up using... – Sigismond 29/4, 2011 at 19:44

Here is an implementation: gist.github.com/ryangomba/1724881 – Edifice 22/4, 2016 at 5:55

See my answer: There now exists an implementation of OPTICS in the pyclustering library. – Expense 11/6, 2018 at 8:24

A

8

EDIT: the following is known to not be a complete implementation of OPTICS.

I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.

Here is a quick example of how to build clusters on the output of the optics algorithm:

def cluster(order, distance, points, threshold):
    ''' Given the output of the options algorithm,
    compute the clusters:

    @param order The order of the points
    @param distance The relative distances of the points
    @param points The actual points
    @param threshold The threshold value to cluster on
    @returns A list of cluster groups
    '''
    clusters = [[]]
    points   = sorted(zip(order, distance, points))
    splits   = ((v > threshold, p) for i,v,p in points)
    for iscluster, point in splits: 
        if iscluster: clusters[-1].append(point)
        elif len(clusters[-1]) > 0: clusters.append([])
    return clusters

    rd, cd, order = optics(points, 4)
    print cluster(order, rd, points, 38.0)

Ashling answered 28/4, 2011 at 17:8 Comment(5)

Thanks Bashwork but that seems the exact same code as what vartec has suggested. The problem with that is, I couldn't figure out how I would extract the clustering structure (which elements belongs to which cluster) from the output of that algorithm. Please have a look at the 'Note' at the very bottom of my question. – Aretta 28/4, 2011 at 22:41

So the code gives you the output you need to extract clusters (the order and the reachability distance). If you look at the wikipedia section for extracting clusters, you simply need to walk the ordered results with a threshold on the distances (lower threshold means more clusters). (en.wikipedia.org/wiki/OPTICS_algorithm). If this doesn't make sense I can give some example code. – Ashling 29/4, 2011 at 17:50

I just ran the code you have posted and the result I get with a threshold of 38 is [[31.0, 87.0], [73.0, 9.0]] [[5.0, 8.0]] [[97.0, 9.0]] (3 clusters). I lower the threshold to 10 and there is only 1 cluster. The test data I used is the same as the one used in the link you gave (testX). I would appreciate it if you can correct the code and I'll award your bounty. – Aretta 29/4, 2011 at 22:39

Based on my understanding of the algorithm, those results are correct as a cluster is created every time the ordered collection descends below the given threshold. In the case of 38, there are three valleys while in the case of 10 there is only one (the zero result). The threshold basically controls what should be considered a valley. – Ashling 2/5, 2011 at 13:46

I would take all this here with a grain of salt. It diverts considerably from my understanding of this algorithm. You might want to get the official implementation (not in python) and compare the results. – Broeder 8/2, 2012 at 17:42

B

10

I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).

OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).

The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:

http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf

The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.

There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.

Broeder answered 8/2, 2012 at 17:49 Comment(4)

Indeed, there are a lots of incomplete OPTICS implementations and clones of the Weka version around. You should take the ELKI version as reference. – Phylis 1/1, 2013 at 11:35

The relative thresholding is to my mind the point where a relatively clear exposition and approach transition into something much more cloudy, with additional heuristics and hidden parameters. There is probably no way around this, but I definitely feel that the intermediate ordered reachability values constitute a nice result that stands on its own. What happens afterwards is open to different approaches, and the one chosen in this paper is not so self-evident as to be the only one worth considering. – Serene 8/1, 2013 at 15:32

There are at least two more methods proposed for how to exact clusters from the plot. Yet, without such a cluster extraction method, is it actually a clustering algorithm? At some point, you do want to get clusters out of it, not just a plot. – Broeder 8/1, 2013 at 17:13

Note that absolute thresholding means you are doing DBSCAN, not OPTICS. Just more complicated and thus slower. – Broeder 8/1, 2013 at 17:17

A

8

EDIT: the following is known to not be a complete implementation of OPTICS.

I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.

Here is a quick example of how to build clusters on the output of the optics algorithm:

def cluster(order, distance, points, threshold):
    ''' Given the output of the options algorithm,
    compute the clusters:

    @param order The order of the points
    @param distance The relative distances of the points
    @param points The actual points
    @param threshold The threshold value to cluster on
    @returns A list of cluster groups
    '''
    clusters = [[]]
    points   = sorted(zip(order, distance, points))
    splits   = ((v > threshold, p) for i,v,p in points)
    for iscluster, point in splits: 
        if iscluster: clusters[-1].append(point)
        elif len(clusters[-1]) > 0: clusters.append([])
    return clusters

    rd, cd, order = optics(points, 4)
    print cluster(order, rd, points, 38.0)

Ashling answered 28/4, 2011 at 17:8 Comment(5)

Thanks Bashwork but that seems the exact same code as what vartec has suggested. The problem with that is, I couldn't figure out how I would extract the clustering structure (which elements belongs to which cluster) from the output of that algorithm. Please have a look at the 'Note' at the very bottom of my question. – Aretta 28/4, 2011 at 22:41

So the code gives you the output you need to extract clusters (the order and the reachability distance). If you look at the wikipedia section for extracting clusters, you simply need to walk the ordered results with a threshold on the distances (lower threshold means more clusters). (en.wikipedia.org/wiki/OPTICS_algorithm). If this doesn't make sense I can give some example code. – Ashling 29/4, 2011 at 17:50

I just ran the code you have posted and the result I get with a threshold of 38 is [[31.0, 87.0], [73.0, 9.0]] [[5.0, 8.0]] [[97.0, 9.0]] (3 clusters). I lower the threshold to 10 and there is only 1 cluster. The test data I used is the same as the one used in the link you gave (testX). I would appreciate it if you can correct the code and I'll award your bounty. – Aretta 29/4, 2011 at 22:39

Based on my understanding of the algorithm, those results are correct as a cluster is created every time the ordered collection descends below the given threshold. In the case of 38, there are three valleys while in the case of 10 there is only one (the zero result). The threshold basically controls what should be considered a valley. – Ashling 2/5, 2011 at 13:46

I would take all this here with a grain of salt. It diverts considerably from my understanding of this algorithm. You might want to get the official implementation (not in python) and compare the results. – Broeder 8/2, 2012 at 17:42

W

7

While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.

Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.

Worried answered 15/4, 2016 at 16:47 Comment(0)

E

6

There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.

Expense answered 11/6, 2018 at 8:19 Comment(0)

P

4

It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)

here is the link: https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html

Plicate answered 13/3, 2019 at 18:6 Comment(0)

N

1

See "Density-based clustering approaches" on http://www.chemometria.us.edu.pl/index.php?goto=downloads

Needlewoman answered 1/4, 2011 at 16:4 Comment(1)

Thanks for the answer vartec, but the implementation seemed incomplete to me. I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x,y) pairs belonging to that cluster. – Aretta 23/4, 2011 at 21:1

S

1

You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

Shalne answered 23/4, 2011 at 21:27 Comment(5)

Thanks epitaph but how exactly does this answer my question? Can you clarify your answer a bit more? – Aretta 23/4, 2011 at 21:55

A sfc is a clustering algorithm using a fractal. A hilbert curve has a fractal dimension of 2. If you have 2d data you can easily subdivide this data into smaller tiles. Basically it's a reordering. It's like storing them in a quadtree. You can also use an adaptive sfc where emtpy regions are skipped or have a lower granularity of the sfc. Sfc is often used in maps, like google maps. – Shalne 23/4, 2011 at 22:2

Sounds good and worth trying. Thank you. But I'm still looking for an OPTICS implementation in Python. – Aretta 23/4, 2011 at 23:32

No problem, I didn't know about OPTICS, I didn't understand it either. It seems to me it is using a spatial index but I don't understand why. – Shalne 23/4, 2011 at 23:43

With a spatial index, OPTICS can run in O(n log n) instead of O(n^2). So for large, indexable data sets, this makes quite a difference. An R*-Tree is probably a better choice than a sfc though. – Broeder 9/2, 2012 at 13:23

Recommended topics

Hot tags