Clustering huge vector space

Asked 8/10, 2009 at 18:51 Answered 11/3, 2020 at 8:50

I'm doing some tests clustering a big number of very large sparse vectors representing term-frequency-inverse-document-frequency of various hypertextual documents. What algorithm would you suggest for clustering this data, taking into account the proportions of the data set? The dimension of the vectors would be > 3·10⁵ and the number of vectors could be around 10⁹. I've taken a look at dbscan and optics algorithms. The number of clusters is not known a priory. And a spatial index with such high dimensionality seems complicated.

Kollwitz answered 8/10, 2009 at 18:51 Comment(0)

I've had almost as good of results with a simple K-means clustering as almost anything else, and it's definitely faster than most of the alternatives. I've also gotten good results with pair-wise agglomeration, but it's quite a bit slower. For K-means, you do have to start with some estimated number of clusters, but you can adjust it algorithmically as you go along. If you find two clusters with means that are too close together, you reduce the number of clusters. If you find clusters with too large a range of variation, you try more clusters. I've found sqrt(N) to be a reasonable starting point -- but I'm usually starting with more like 10^7 documents rather than 10^9. For 10^9, it might make sense to reduce that somewhat.

If it were up to me, however, I'd think very hard about starting by reducing the dimensionality with something like Landmark MDS, then doing the clustering.

Mulligan answered 8/10, 2009 at 19:12 Comment(1)

K-Means should always be the first segmentation technique you try when trying to cluster anything. Is simple, efficient and provides excellent results most of the time. The only downside is having to choose an appropriate value of K. You can always try an increasing sequence of K's calculating your intercluster variance as a criteria for the quality of the clustering. This however, does not work that well in practice. – Cidevant 8/10, 2009 at 22:6

I hear that semantic hashing achieves excellent results. However, deep belief nets are prettyhard to implement. You might want to try min hashing (that's a probabilistic approach, though) or locality sensistive hashing for euclidean spaces.

In general, clustering in such high dimensional spaces is difficult due to the curse of dimensionalty and the fact that most items have similar distances to each other. Standard approaches like K-Means might work if you reduce the dimensionality beforehand via SOMs or PCA.

Woodson answered 8/10, 2009 at 19:17 Comment(1)

Thanks for the interesting links. – Kollwitz 11/10, 2009 at 12:22

When clustering data I would always try at least these two algorithms in this order:

K-Means : try tweaking the results as much as possible. If you can get K-Means to work for you and provide decent results you will almost surely not do better when any other algorithm.
Expectation Maximization: the K-means algorithm was actually developed to be a cheap and good alternative to the EM algorithm. The EM algorithm is more complex to understand and more expensive to compute, but the results from EM are excellent. You can learn more about EM http://en.wikipedia.org/wiki/Expectation-maximization_algorithm . There is an OpenCv implementation of EM : http://opencv.willowgarage.com/documentation/expectation-maximization.html

If the results of neither of these two are satisfactory, I would start looking elsewhere, but not until you have tried both.

Cidevant answered 8/10, 2009 at 22:10 Comment(2)

Isn't K-Means an instance of EM? – Woodson 9/10, 2009 at 14:24

@bayer: No they are most certainly not the same algorithm if thats what you mean. K-Means is non-parametric but EM is (meaning EM asserts that there is an underlying multi-variate gaussian distribution for the data which is not a very stringent assumption if you consider the central limit theorem.) From what I understand, the EM algorithm sometimes gets grouped as a meta-algorithm where other algorithms fall under it. It can actually be implemented stand alone from what I've seen. – Cidevant 9/10, 2009 at 16:59

Decision trees are popular for working efficiently in high-dimensional spaces.Check out Clustering Via Decision Tree Construction.

Also, Randomized Forests are extremely efficient learners and an OpenCV implementation exists if you want to play with it.

Schistosome answered 8/10, 2009 at 19:13 Comment(0)

K-means or Expectation-Maximization algorithm computation is very expensive on large data sets with high dimensions. The article here describes efficient pre-clustering with canopy clustering:

Efficient clustering of high-dimensional data sets with application to reference matching

Generally, the quadratic time complexity of standard K-means or EM is not efficient for Big Data and either well scalable for growing data set. Instead of I would find out algorithms with time complexity O(n) or O(n*log(n)).

Also K-means doesn't guarantee global convergence but converges to local minimum. EM converges always to global minimum. Also EM could be mathematically derive from K-means as its probabilistic variant.

Indulgent answered 11/3, 2020 at 8:50 Comment(0)

Recommended topics

Hot tags