What are some packages that implement semi-supervised (constrained) clustering?

Asked 21/1, 2014 at 12:37 Answered 14/12, 2020 at 7:40

I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN). Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.

Littles answered 21/1, 2014 at 12:37 Comment(3)

You may want to have a look at ELKI. It has tons of clustering algorithms, but I don't recall seeing a constrained clustering in there. Do you have any non-synthetic data sets for this? I always have the impression that this is a purely academic thing. C-DBSCAN might be easy to implement ontop of ELKIs "GeneralizedDBSCAN". – Rotorua 22/1, 2014 at 9:25

I'll look into ELKI code, but a first glance suggests that I'll have to build C-DBSCAN on top of the 'GeneralizedDBSCAN' class. And you're correct, I don't have any non-synthetic data sets for this. And this is purely for academic interest. :) – Littles 27/1, 2014 at 6:27

Even for academic interest, it should be applicable to real data. There are too many algorithms already that only work with synthetic Gaussian distributions, probably because that is all the authors ever worked on... – Rotorua 27/1, 2014 at 8:15

The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.

Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.

Jolt answered 16/3, 2015 at 14:39 Comment(0)

The R package conclust implements a number of algorithms:

There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.

There's also an implementation of COP-KMeans in python.

Tatum answered 9/2, 2017 at 20:38 Comment(0)

Maybe its a bit late but have a look at the following.

An extension of Weka (in java) that implements PKM, MKM and PKMKM

http://www.cs.ucdavis.edu/~davidson/constrained-clustering/
Gaussian mixture model using EM and constraints in Matlab

http://www.scharp.org/thertz/code.html

I hope that this helps.

Elamite answered 1/4, 2014 at 14:20 Comment(0)

Full disclosure. I am the author of k-means-constrained.

Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.

You can pip install it:

pip install k-means-constrained

Example use:

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>>                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>>     n_clusters=2,
>>>     size_min=2,
>>>     size_max=5,
>>>     random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

Tomy answered 18/6, 2020 at 8:28 Comment(0)

Github Semisupervised has the similar usage like Sklearn API.

pip install semisupervised

Step 1. The unlabeled samples should be labeled as -1.

Step2. model.fit(X,y)

Step3. model.predict(X_test)

Examples:

from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)

Connacht answered 14/12, 2020 at 7:40 Comment(1)

How can I extend this to a multiclass problem for image classification? – Pettus 27/5, 2021 at 12:51

Check out this python package active-semi-supervised-clustering

Github https://github.com/datamole-ai/active-semi-supervised-clustering

Seitz answered 2/7, 2020 at 15:54 Comment(0)

Recommended topics

Hot tags