How can I use KNN /K-means to clustering time series in a dataframe

Asked 6/7, 2017 at 9:33 Answered 27/9, 2018 at 17:38

python time-series cluster-analysis

Suppose a dataframe which contains 1000 rows. Each row represents a time series.

Then I built a DTW algorithm to calculate the distance between 2 rows.

I don't know what to do next to complish an unsupervised classification task for the dataframe.

How to label all rows of the dataframe?

Chlordane answered 6/7, 2017 at 9:33 Comment(1)

There is no knn clustering. – Spidery 7/7, 2017 at 6:41

Definitions

KNN algorithm = K-nearest-neighbour classification algorithm

K-means = centroid-based clustering algorithm

DTW = Dynamic Time Warping a similarity-measurement algorithm for time-series

I show below step by step about how the two time-series can be built and how the Dynamic Time Warping (DTW) algorithm can be computed. You can build a unsupervised k-means clustering with scikit-learn without specifying the number of centroids, then the scikit-learn knows to use the algorithm called auto.

Building the time-series and computing the DTW

You have have two time-series and you compute the DTW such that

import pandas as pd
import numpy as np
import random
from dtw import dtw
from matplotlib.pyplot import plot
from matplotlib.pyplot import imshow
from matplotlib.pyplot import cm

from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer 
#About classification, read the tutorial
#http://scikit-learn.org/stable/tutorial/basic/tutorial.html


def createTs(myStart, myLength):
    index = pd.date_range(myStart, periods=myLength, freq='H'); 
    values= [random.random() for _ in range(myLength)];
    series = pd.Series(values, index=index);  
    return(series)


#Time series of length 30, start from 1/1/2000 & 1/2/2000 so overlap
myStart='1/1/2000'
myLength=30
timeS1=createTs(myStart, myLength)
myStart='1/2/2000'
timeS2=createTs(myStart, myLength) 

#This could be your dataframe but unnecessary here
#myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2'])

x=[xxx*100 for xxx in sorted(timeS1.data)]
y=[xx for xx in timeS2.data]

choice="dtw"

if (choice="timeseries"):
    print(timeS1)
    print(timeS2)
if (choice=="drawingPlots"):
    plot(x)
    plot(y)
if (choice=="dtw"):
    #DTW with the 1st order norm
    myDiff=[xx-yy for xx,yy in zip(x,y)]
    dist, cost, acc, path = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1))
    imshow(acc.T, origin='lower', cmap=cm.gray, interpolation='nearest')
    plot(path[0], path[1], 'w')

Classification of the time-series with KNN

It is not evident in the question about what should be labelled and with which labels? So please provide the following details

What should we label in the data-frame? The path computed by DTW algorithm?
Which type of labeling? Binary? Multiclass?

after which we can decide our classification algorithm that may be the so-called KNN algorithm. It works such that you have two separate data sets: training set and test set. By training set, you teach the algorithm to label the time series while the test set is a tool by which we can measure about how well the model works with model selection tools such as AUC.

Small puzzle left open until details provided about the questions

#PUZZLE
#from tutorial (#http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
newY = MultiLabelBinarizer().fit_transform(newY)
#Continue to the article.

Scikit-learn comparison article about classifiers is provided in the second enumerate item below.

Clustering with K-means (not the same as KNN)

K-means is the clustering algorithm and its unsupervised version you can use such that

#Unsupervised version "auto" of the KMeans as no assignment for the n_clusters
myClusters=KMeans(path)
#myClusters.fit(YourDataHere)

which is very different algorithm than the KNN algorithm: here we do not need any labels. I provide you further material on the topic below in the first enumerate item.

Thank you!Bro! Yes, I want to sort and label different sequences. Now I have 50k rows of time series data about virtual machines workload. I want to use some methods to do a unsupervised classification job for them. As I know , I try to use dtw (compute the distance between each 2 time series)or maybe use rnn/lstm to complish this target(But now I have not a further solution. So I need your help,thinks guy!). – Chlordane 10/7, 2017 at 2:39

Then ,After the time series data classified and labeled(based on the similarity between range of fluctuation and amplitude of fluctuation). I want to build some optimization portfolio for them. Because these time series are the virtual machines' cpu workload. I want to find out a way to build [a Cloud Server Optimization with Load Balancing]. – Chlordane 10/7, 2017 at 2:44

Please forgive my poor english expression.Thank you! This is a really urgent assignment. – Chlordane 10/7, 2017 at 2:47

We should label a classification like 1,2,3,4.. etc, just labels for each time series. But the numbers are not confirmed. Firstly,I want to use dtw or rnn methods the do an unsupervised label task. And then I try to do an optimization portfolio for these time series to make some optimized combination.So these combination of virtual machines' workload would make a load balance. Am I expressing clearly? – Chlordane 10/7, 2017 at 2:53

You can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python.

Here is the logic;

Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
For each cluster groups, calculate total distance within individual clusters.
Choose the minimum.

And the code;

import numpy as np
import pandas as pd
from itertools import combinations
import time

def dtw_distance(x, y, d=lambda x,y: abs(x-y), scaled=False, fill=True):
    """Finds the distance of two arrays by dynamic time warping method
    source: https://en.wikipedia.org/wiki/Dynamic_time_warping

    Dependencies:
        import numpy as np
    Args:
        x, y: arrays
        d: distance function, default is absolute difference
        scaled: boolean, should arrays be scaled before calculation
        fill: boolean, should NA values be filled with 0
    returns:
        distance as float, 0.0 means series are exactly same, upper limit is infinite
    """
    if fill:
        x = np.nan_to_num(x)
        y = np.nan_to_num(y)
    if scaled:
        x = array_scaler(x)
        y = array_scaler(y)
    n = len(x) + 1
    m = len(y) + 1
    DTW = np.zeros((n, m))
    DTW[:, 0] = float('Inf')
    DTW[0, :] = float('Inf')
    DTW[0, 0] = 0

    for i in range(1, n):
        for j in range(1, m):
            cost = d(x[i-1], y[j-1])
            DTW[i, j] = cost + min(DTW[i-1, j], DTW[i, j-1], DTW[i-1, j-1])

    return DTW[n-1, m-1]

def array_scaler(x):
    """Scales array to 0-1

    Dependencies:
        import numpy as np
    Args:
        x: mutable iterable array of float
    returns:
        scaled x
    """
    arr_min = min(x)
    x = np.array(x) - float(arr_min)
    arr_max = max(x)
    x = x/float(arr_max)
    return x


class TrendCluster():
    def __init__(self):
        self.clusters = None
        self.centers = None
        self.scale = None

    def fit(self, series, n=2, scale=True):
        '''
        Work-flow
        1 - make series combination with size n, initial clusters
        2 - assign closest series to each cluster
        3 - calculate total distance for each combinations
        4 - choose the minimum

        Args:
            series: dict, keys can be anything, values are time series as list, assumes no nulls
            n: int, cluster size
            scale: bool, if scale needed
        '''
        assert isinstance(series, dict) and isinstance(n, int) and isinstance(scale, bool), 'wrong argument type'
        assert n < len(series.keys()), 'n is too big'
        assert len(set([len(s) for s in series.values()])) == 1, 'series length not same'

        self.scale = scale

        combs = combinations(series.keys(), n)
        combs = [[c, -1] for c in combs]

        series_keys = pd.Series(series.keys())
        dtw_matrix = pd.DataFrame(series_keys.apply(lambda x: series_keys.apply(lambda y: dtw_distance(series[x], series[y], scaled=scale))))
        dtw_matrix.columns, dtw_matrix.index = series_keys, series_keys
        for c in combs:
            c[1] = dtw_matrix.loc[c[0], :].min(axis=0).sum()

        combs.sort(key=lambda x: x[1])
        self.centers = {c:series[c] for c in combs[0][0]}
        self.clusters = {c:[] for c in self.centers.keys()}

        for k, _ in series.items():
            tmp = [[c, dtw_matrix.loc[k, c]] for c in self.centers.keys()]
            tmp.sort(key=lambda x: x[1])
            cluster = tmp[0][0]
            self.clusters[cluster].append(k)

        return None

    def assign(self, serie, save=False):
        '''
        Assigns the serie to appropriate cluster

        Args:
            serie, dict: 1 element dict
            save, bool: if new serie is stored to clusters

        Return:
            str, assigned cluster key
        '''
        assert isinstance(serie, dict) and isinstance(save, bool), 'wrong argument type'
        assert len(serie) == 1, 'serie\'s length is not exactly 1'

        tmp = [[c, dtw_distance(serie.values()[0], self.centers[c], scaled=self.scale)] for c in self.centers.keys()]
        tmp.sort(key=lambda x: x[1])
        cluster = tmp[0][0]
        if save:
            self.clusters[cluster].append(serie.keys()[0])

        return cluster

If you want to see it on action, you can refer my repository about Time Series Clustering.

Penelope answered 27/9, 2018 at 17:38 Comment(2)

I too will be writing my own DTW clustering algorithm - your implementation of the DTW distance is the most straightforward I've seen. However, two questions (these are questions based on the code in your repository): 1) Why do you calculate the warp path for each DTW in the fit function of your clustering class, and not in the distance function? 2) I'm having a little trouble understanding your clustering logic - is this a form of hierarchical clustering? I'm new to this subject and I'm trying to learn more, and I'd like to read more about your specific implementation. Thanks! – Mercurial 19/12, 2018 at 16:4

I calculate distance by using dtw_distance in fit method for each sample pairs, nxn times. That's why there is a square matrix. The distance function is just a tool, fit method is where that tool is used. For the second questions, I think it's closer to partitioning. I choose the best centers among the samples (instead of K-means' random creation and iteration). Then, I just assign the rest by choosing the closest. The definition of best is the smallest total distance of combinations. That is also the reason why it's a bit slow. – Penelope 20/12, 2018 at 17:23

Building the time-series and computing the DTW

Classification of the time-series with KNN

Clustering with K-means (not the same as KNN)

Further reading

Recommended topics

Hot tags