I'm doing some tests clustering a big number of very large sparse vectors representing term-frequency-inverse-document-frequency of various hypertextual documents. What algorithm would you suggest for clustering this data, taking into account the proportions of the data set? The dimension of the vectors would be > 3·105 and the number of vectors could be around 109. I've taken a look at dbscan and optics algorithms. The number of clusters is not known a priory. And a spatial index with such high dimensionality seems complicated.
I've had almost as good of results with a simple K-means clustering as almost anything else, and it's definitely faster than most of the alternatives. I've also gotten good results with pair-wise agglomeration, but it's quite a bit slower. For K-means, you do have to start with some estimated number of clusters, but you can adjust it algorithmically as you go along. If you find two clusters with means that are too close together, you reduce the number of clusters. If you find clusters with too large a range of variation, you try more clusters. I've found sqrt(N) to be a reasonable starting point -- but I'm usually starting with more like 10^7 documents rather than 10^9. For 10^9, it might make sense to reduce that somewhat.
If it were up to me, however, I'd think very hard about starting by reducing the dimensionality with something like Landmark MDS, then doing the clustering.
I hear that semantic hashing achieves excellent results. However, deep belief nets are prettyhard to implement. You might want to try min hashing (that's a probabilistic approach, though) or locality sensistive hashing for euclidean spaces.
In general, clustering in such high dimensional spaces is difficult due to the curse of dimensionalty and the fact that most items have similar distances to each other. Standard approaches like K-Means might work if you reduce the dimensionality beforehand via SOMs or PCA.
When clustering data I would always try at least these two algorithms in this order:
K-Means : try tweaking the results as much as possible. If you can get K-Means to work for you and provide decent results you will almost surely not do better when any other algorithm.
Expectation Maximization: the K-means algorithm was actually developed to be a cheap and good alternative to the EM algorithm. The EM algorithm is more complex to understand and more expensive to compute, but the results from EM are excellent. You can learn more about EM http://en.wikipedia.org/wiki/Expectation-maximization_algorithm . There is an OpenCv implementation of EM : http://opencv.willowgarage.com/documentation/expectation-maximization.html
If the results of neither of these two are satisfactory, I would start looking elsewhere, but not until you have tried both.
Decision trees are popular for working efficiently in high-dimensional spaces.Check out Clustering Via Decision Tree Construction.
Also, Randomized Forests are extremely efficient learners and an OpenCV implementation exists if you want to play with it.
K-means or Expectation-Maximization algorithm computation is very expensive on large data sets with high dimensions. The article here describes efficient pre-clustering with canopy clustering:
Efficient clustering of high-dimensional data sets with application to reference matching
Generally, the quadratic time complexity of standard K-means or EM is not efficient for Big Data and either well scalable for growing data set. Instead of I would find out algorithms with time complexity O(n) or O(n*log(n)).
Also K-means doesn't guarantee global convergence but converges to local minimum. EM converges always to global minimum. Also EM could be mathematically derive from K-means as its probabilistic variant.
© 2022 - 2024 — McMap. All rights reserved.