Hello Community,
I'm new (as a member) to the site, so if you think it might be better to post it on http://datascience.stackexchange.com, let me know.
I am tackling a Machine Learning problem which requires to calculate the distance between NxM-dimensional elements, in order to implement certain Classification algorithms.
The element's attribute is a 2D matrix (Matr), thus I'm searching for the best algorithm to calculate the distance between 2D matrices. As you will see bellow the "easy" solution is to convert the 2D into a 1D (vector) and then implement any distance algorithm, but I'm searching for something more convenient (if exists).
So far I have used the following approaches:
Euclidean distance between each element.
import numpy as np def dist_euclidean(elem1, elem2): t_sum=0 for i in range(len(elem1.Matr)): for j in range(len(elem1.Matr[0])): t_sum+= np.square(elem1.Matr[i][j]-elem2.Matr[i][j]) return np.sqrt(t_sum)
Cosine Similarity, in which I had to convert the (NxM) 2D matrix into (1xNM) Vector.
from scipy.spatial import distance def dist_cosine(elem1, elem2): temp1=[] temp2=[] for i in range(len(elem1.Matr)): temp1.extend(elem1.Matr[i]) temp2.extend(elem2.Matr[i]) return distance.cosine(temp1, temp2)
KL divergence (wiki), also found implementation only for 1D matrix (Vector), thus did the following conversions:
Found the entropy between each corresponding row and then average them.
import numpy as np from scipy.stats import entropy def dist_KL_row_avg(elem1, elem2): Y=[] for i in range(len(elem1.Matr)): Y.append(entropy(elem1.Matr[i], elem2.Matr[i])) return np.average(Y)
Convert the (NxM) 2D matrix into (1xNM) Vector by appending the rows and then calculating the total entropy.
import numpy as np from scipy.stats import entropy def dist_KL_1d_total(elem1, elem2): temp1=[] temp2=[] for i in range(len(elem1.Matr)): temp1.extend(elem1.Matr[i]) temp2.extend(elem2.Matr[i]) return entropy(temp1, temp2)
KS test (wiki), also found implementation only for 1D matrix (Vector), thus did the same conversions as in the KL implementation:
Found the entropy between each corresponding row and then average them.
import numpy as np from scipy.stats import ks_2samp def dist_KS_row_avg(elem1, elem2): Y=[] Z=[] for i in range(len(elem1.Matr)): Y.append(ks_2samp(elem1.Matr[i], elem2.Matr[i])) Z=[x[0]/x[1] for x in Y] return np.average(Z)
Convert the (NxM) 2D matrix into (1xNM) Vector by appending the rows and then calculating the total entropy.
import numpy as np from scipy.stats import ks_2samp def dist_KS_1d_total(elem1, elem2): temp1=[] temp2=[] for i in range(len(elem1.Matr)): temp1.extend(elem1.Matr[i]) temp2.extend(elem2.Matr[i]) Y = ks_2samp(temp1, temp2) return Y[0]/Y[1]
All of the above work in my problem but I got curious since I couldn't find anything more specific that satisfied me.
Edit 1. As pltrdy suggested, here are some more info regarding the problem.
The initial data of each element is a series of codes ex(C->B->D->B->A) which then is converted to a transition matrix which is also normalized for each row. Thus each cell in our matrix represents the probability of transition from code [i] to code [j]. For example:
IN: A->C->B->B->A->C->C->A
OUT:
A B C
A 0 0 1
B 0.5 0.5 0
C 0.33 0.33 0.33
Having that in mind, the final goal is to classify the different code series. The series do not have the same length but are made from the same codes. Thus the transition probability matrix has the same dimensions in every case. I had the initial question in order to find the most suitable distance algorithm, which is going to produce the best classification results.