How to cluster by trend instead of by distance in R?
Asked Answered
C

3

6

The k-medoids in the clara() function uses distance to form clusters so I get this pattern:

a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5)
cl <- clara(a,2)
matplot(t(a),type="b", pch=20, col=cl$clustering) 

clustering by clara()

But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.

Crease answered 11/5, 2012 at 17:13 Comment(0)
C
5

This question might be better suited to stats.stackexchange.com, but here's a solution anyway.

Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.

Here's one option:

a1 <- t(apply(a,1,scale))
a2 <- t(apply(a1,1,diff))

cl <- clara(a2,2)
matplot(t(a),type="b", pch=20, col=cl$clustering) 

enter image description here

Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.

Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.

Chemisorption answered 11/5, 2012 at 17:25 Comment(0)
S
3

Do more preprocessing. To any data mining, preprocessing is 90% of the effort.

For example, if you want to cluster by trends, then you maybe should apply the clustering to the trends, and not the raw values. So for example, standardize the curves each to a mean of 0 and a standard deviation of 1. Then compute the differences from one value to the next, then apply the clustering to this preprocessed data!

Seasonal answered 11/5, 2012 at 18:37 Comment(2)
Is this different from what @Chemisorption has proposed? I might not be seeing the differenceCrease
Having just read through his answer: no, it's not substantially different. I'm suggesting a different scaling. However, the key point that I wanted to point out is that this belongs to the important step of preprocessing that you must not neglect. That's why there is always so much talk about the KDD process: en.wikipedia.org/wiki/Data_mining#Process It's 90% of the effort in real mining, it's 5% of the scientific results at most, which focus on new algorithms.Seasonal
P
1

You can use k means clustering algorithm but before going there I suggest you create an N* N matrix where each element represents correlation score of a trend vs another trend.

Then use any clustering algorithm like kmeans/hierarchical clustering to cluster similar trends.

R Code

a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1),byrow=T, nrow=5)

library(TSclust)

library(reshape2)

Tech1 <- diss(a,"COR")       # Correlation
Tech2 <- diss(a,"EUC")       # Euclidean Distance
Tech3 <- diss(a, "DTW")      # Dynamic Time Wrapping

clust1 <- kmeans(Tech1, 3)
clust1 <- kmeans(Tech2, 3)
clust1 <- kmeans(Tech3, 3)

clust1$cluster
>> 1 2 3 4 5 
>> 1 2 2 3 3 

clust2$cluster
>> 1 2 3 4 5 
>> 1 2 2 3 3

clust3$cluster
>> 1 2 3 4 5 
>> 3 2 2 1 1 
Pants answered 4/2, 2019 at 7:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.