Dumping clustering result with vectors names

Asked 23/1, 2013 at 9:51 Answered 1/4, 2015 at 23:50

I have created my Vectors as described in this question and have run mahout kmeans on the data.

Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this:

export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar
hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT

and I am getting lines like this one:

VL-1383471{n=192 c=[0.180, -0.087, 0.281, 0.512, 0.678, 1.833, 2.613, 0.313, 0.226, 1.023, 0.229, -0.104, -0.461, -0.553, -0.318, 0.315, 0.658, 0.245, 0.635, 0.220, 0.660, 0.193, 0.277, -0.182, 0.497, 0.346, 0.658, 0.660, 0.191, 0.660, 0.636, 0.018, 0.519, 0.335, 0.535, 0.008, -0.028, 0.461, 0.229, 0.287, 0.619, 0.509, 0.566, 0.389, -0.075, -0.180, -0.461, 0.381, -0.108, 0.126, -0.728] r=[0.983, 0.890, 0.384, 0.823, 0.702, 0.000, 0.000, 1.132, 0.605, 0.979, 0.897, 0.862, 0.438, 0.546, 0.390, 0.171, 0.257, 0.234, 0.251, 0.106, 0.257, 0.093, 0.929, 0.077, 0.204, 0.218, 0.257, 0.257, 0.258, 0.257, 0.249, 0.112, 0.217, 0.157, 0.284, 0.197, 0.228, 0.229, 0.323, 0.401, 0.248, 0.217, 0.269, 1.002, 0.819, 0.706, 0.412, 0.964, 0.787, 0.872, 0.172]}

which is not yet useful to me, since I need the names of my vectors in each cluster. I saw that for text documents a dictionary file is created. How would I create a dictionary for my data?

Also, using -of CSV gives me an empty file, am I doing something wrong?

Another attempt I took was to directly access the cluster-20-final/part-m-00000 file, like it's done in listing 7.2 of Mahout in Action. Turns out it doesn't contain WeightedVectorWritable but ClusterWritable, from which I can get the Cluster instance but not any actual contained Vector.

Greyson answered 23/1, 2013 at 9:51 Comment(0)

A bit late, but this might help someone somewhere, sometime.

When running

KMeansDriver.run(input, clustersIn, outputPath, measure, convergenceDelta, maxIterations, true, 0.0, false);

One of the outputs was a directory called clusteredPoints. There is a part file there with all the clustered vectors by cluster. This means that something like this

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();

    Path clusteredPoints = new Path(output + "/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000");

    FileSystem fs = FileSystem.get(clusteredPoints.toUri(), new Configuration());

    try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, clusteredPoints, fs.getConf())) {

        while (reader.next(key, value)) {
            // Do something useful here
            ((NamedVector) value.getVector()).getName();
        }

    } catch (Throwable t) {
        throw t;
    }

seems to do the trick. Using something like this, I was able to get a good sense of what was clustered where when running my tests with k-means clustering and Mahout.

I was using Mahout 0.8 when I did this.

Marcus answered 18/11, 2013 at 3:0 Comment(1)

Sorry for bumping like this, but do you maybe have an idea why the file in clusteredPoints is empty? And I get cluster centroids in clusters-x and clusters-x-final, but the file in clusteredPoints is empty. I used you code, and it just exits the while loop. Any help would be great. Thanks – Serle 25/8, 2014 at 14:12

(a really late answer, but since I just spent a day figuring this out thought I would share it)

What you are missing is the dictionary of Vector Dimension name to its index. This dictionary will be used by clusterdump to give you the names of the different dimensions in the vector.

When you run clusterdump, you can specify two additional flags:

d: dictionary file
dt: type of the dictionary file (text|sequencefile)

Here is a sample invocation:

mahout clusterdump -i clusteringExperiment/exp1/initialCentroids/clusters-0-final -d clusteringExperiment/dictionary/vectorDimensions -dt sequencefile

and your output will look something like:

VL-0{n=185 c=[A:0.006, G:0.550, M:0.011, O:0.026, S:0.000, T:0.072, U:0.096, V:0.010] r=[A:0.029, G:0.176, M:0.043, O:0.054, S:0.001, T:0.098, U:0.113, V:0.035]}

Note that the dictionary is a simple key value file, where the key is the category name (a string), and the value is the numerical index.

Kazoo answered 1/4, 2015 at 23:50 Comment(0)

Recommended topics

Hot tags