I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN). Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.
The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.
Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.
The R package conclust implements a number of algorithms:
There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.
There's also an implementation of COP-KMeans in python.
Maybe its a bit late but have a look at the following.
An extension of Weka (in java) that implements PKM, MKM and PKMKM
Gaussian mixture model using EM and constraints in Matlab
I hope that this helps.
Full disclosure. I am the author of k-means-constrained.
Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.
You can pip install it:
pip install k-means-constrained
Example use:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>> [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>> n_clusters=2,
>>> size_min=2,
>>> size_max=5,
>>> random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
Github Semisupervised has the similar usage like Sklearn API.
pip install semisupervised
Step 1. The unlabeled samples should be labeled as -1.
Step2. model.fit(X,y)
Step3. model.predict(X_test)
Examples:
from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)
Check out this python package active-semi-supervised-clustering
Github https://github.com/datamole-ai/active-semi-supervised-clustering
© 2022 - 2024 — McMap. All rights reserved.