Python Implementation of OPTICS (Clustering) Algorithm
Asked Answered
D

7

32

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).

I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.

Decastere answered 1/4, 2011 at 15:43 Comment(10)
did you look at SciPy: docs.scipy.org/doc/scipy/reference/cluster.html?Needlewoman
@Needlewoman - Yes, I did. Actually, I was using the hierarchical clustering methods provided there (fcluster). But now, I'd like to switch to OPTICS.Aretta
Is OPTICS such an unfamiliar/unknown algorithm that my question gets no attention? =(Aretta
Have you considered using K-Means?Kruger
@pyrony - I'm looking for a density-based clustering approach. Furthermore, specifying the 'k' value in K-Means is ill-suited for the kind of task I'm working on.Aretta
There is some psudocode in the linked wikipedea article, have you tried translating this? If so what problems did you encounter?Sigismond
No, I haven't tried it yet. I'm considering it as the last option :)Aretta
For me it would be the first option - it's the best way to learn more about the implementation you eventually end up using...Sigismond
Here is an implementation: gist.github.com/ryangomba/1724881Edifice
See my answer: There now exists an implementation of OPTICS in the pyclustering library.Expense
A
8

EDIT: the following is known to not be a complete implementation of OPTICS.

I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.

Here is a quick example of how to build clusters on the output of the optics algorithm:

def cluster(order, distance, points, threshold):
    ''' Given the output of the options algorithm,
    compute the clusters:

    @param order The order of the points
    @param distance The relative distances of the points
    @param points The actual points
    @param threshold The threshold value to cluster on
    @returns A list of cluster groups
    '''
    clusters = [[]]
    points   = sorted(zip(order, distance, points))
    splits   = ((v > threshold, p) for i,v,p in points)
    for iscluster, point in splits: 
        if iscluster: clusters[-1].append(point)
        elif len(clusters[-1]) > 0: clusters.append([])
    return clusters

    rd, cd, order = optics(points, 4)
    print cluster(order, rd, points, 38.0)
Ashling answered 28/4, 2011 at 17:8 Comment(5)
Thanks Bashwork but that seems the exact same code as what vartec has suggested. The problem with that is, I couldn't figure out how I would extract the clustering structure (which elements belongs to which cluster) from the output of that algorithm. Please have a look at the 'Note' at the very bottom of my question.Aretta
So the code gives you the output you need to extract clusters (the order and the reachability distance). If you look at the wikipedia section for extracting clusters, you simply need to walk the ordered results with a threshold on the distances (lower threshold means more clusters). (en.wikipedia.org/wiki/OPTICS_algorithm). If this doesn't make sense I can give some example code.Ashling
I just ran the code you have posted and the result I get with a threshold of 38 is [[31.0, 87.0], [73.0, 9.0]] [[5.0, 8.0]] [[97.0, 9.0]] (3 clusters). I lower the threshold to 10 and there is only 1 cluster. The test data I used is the same as the one used in the link you gave (testX). I would appreciate it if you can correct the code and I'll award your bounty.Aretta
Based on my understanding of the algorithm, those results are correct as a cluster is created every time the ordered collection descends below the given threshold. In the case of 38, there are three valleys while in the case of 10 there is only one (the zero result). The threshold basically controls what should be considered a valley.Ashling
I would take all this here with a grain of salt. It diverts considerably from my understanding of this algorithm. You might want to get the official implementation (not in python) and compare the results.Broeder
B
10

I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).

OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).

The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:

http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf

The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.

There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.

Broeder answered 8/2, 2012 at 17:49 Comment(4)
Indeed, there are a lots of incomplete OPTICS implementations and clones of the Weka version around. You should take the ELKI version as reference.Phylis
The relative thresholding is to my mind the point where a relatively clear exposition and approach transition into something much more cloudy, with additional heuristics and hidden parameters. There is probably no way around this, but I definitely feel that the intermediate ordered reachability values constitute a nice result that stands on its own. What happens afterwards is open to different approaches, and the one chosen in this paper is not so self-evident as to be the only one worth considering.Serene
There are at least two more methods proposed for how to exact clusters from the plot. Yet, without such a cluster extraction method, is it actually a clustering algorithm? At some point, you do want to get clusters out of it, not just a plot.Broeder
Note that absolute thresholding means you are doing DBSCAN, not OPTICS. Just more complicated and thus slower.Broeder
A
8

EDIT: the following is known to not be a complete implementation of OPTICS.

I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.

Here is a quick example of how to build clusters on the output of the optics algorithm:

def cluster(order, distance, points, threshold):
    ''' Given the output of the options algorithm,
    compute the clusters:

    @param order The order of the points
    @param distance The relative distances of the points
    @param points The actual points
    @param threshold The threshold value to cluster on
    @returns A list of cluster groups
    '''
    clusters = [[]]
    points   = sorted(zip(order, distance, points))
    splits   = ((v > threshold, p) for i,v,p in points)
    for iscluster, point in splits: 
        if iscluster: clusters[-1].append(point)
        elif len(clusters[-1]) > 0: clusters.append([])
    return clusters

    rd, cd, order = optics(points, 4)
    print cluster(order, rd, points, 38.0)
Ashling answered 28/4, 2011 at 17:8 Comment(5)
Thanks Bashwork but that seems the exact same code as what vartec has suggested. The problem with that is, I couldn't figure out how I would extract the clustering structure (which elements belongs to which cluster) from the output of that algorithm. Please have a look at the 'Note' at the very bottom of my question.Aretta
So the code gives you the output you need to extract clusters (the order and the reachability distance). If you look at the wikipedia section for extracting clusters, you simply need to walk the ordered results with a threshold on the distances (lower threshold means more clusters). (en.wikipedia.org/wiki/OPTICS_algorithm). If this doesn't make sense I can give some example code.Ashling
I just ran the code you have posted and the result I get with a threshold of 38 is [[31.0, 87.0], [73.0, 9.0]] [[5.0, 8.0]] [[97.0, 9.0]] (3 clusters). I lower the threshold to 10 and there is only 1 cluster. The test data I used is the same as the one used in the link you gave (testX). I would appreciate it if you can correct the code and I'll award your bounty.Aretta
Based on my understanding of the algorithm, those results are correct as a cluster is created every time the ordered collection descends below the given threshold. In the case of 38, there are three valleys while in the case of 10 there is only one (the zero result). The threshold basically controls what should be considered a valley.Ashling
I would take all this here with a grain of salt. It diverts considerably from my understanding of this algorithm. You might want to get the official implementation (not in python) and compare the results.Broeder
W
7

While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.

Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.

Worried answered 15/4, 2016 at 16:47 Comment(0)
E
6

There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.

Expense answered 11/6, 2018 at 8:19 Comment(0)
P
4

It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)

here is the link: https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html

Plicate answered 13/3, 2019 at 18:6 Comment(0)
N
1

See "Density-based clustering approaches" on http://www.chemometria.us.edu.pl/index.php?goto=downloads

Needlewoman answered 1/4, 2011 at 16:4 Comment(1)
Thanks for the answer vartec, but the implementation seemed incomplete to me. I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x,y) pairs belonging to that cluster.Aretta
S
1

You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

Shalne answered 23/4, 2011 at 21:27 Comment(5)
Thanks epitaph but how exactly does this answer my question? Can you clarify your answer a bit more?Aretta
A sfc is a clustering algorithm using a fractal. A hilbert curve has a fractal dimension of 2. If you have 2d data you can easily subdivide this data into smaller tiles. Basically it's a reordering. It's like storing them in a quadtree. You can also use an adaptive sfc where emtpy regions are skipped or have a lower granularity of the sfc. Sfc is often used in maps, like google maps.Shalne
Sounds good and worth trying. Thank you. But I'm still looking for an OPTICS implementation in Python.Aretta
No problem, I didn't know about OPTICS, I didn't understand it either. It seems to me it is using a spatial index but I don't understand why.Shalne
With a spatial index, OPTICS can run in O(n log n) instead of O(n^2). So for large, indexable data sets, this makes quite a difference. An R*-Tree is probably a better choice than a sfc though.Broeder

© 2022 - 2024 — McMap. All rights reserved.