Similarity distance measures

Asked 11/5, 2013 at 11:29 Answered 17/4, 2018 at 6:34

Solved vector data-mining similarity hamming-distance

Vectors like this

v1 = {0 0 0 1 1 0 0 1 0 1 1}
v2 = {0 1 1 1 1 1 0 1 0 1 0}
v3 = {0 0 0 0 0 0 0 0 0 0 1}

Need to calculate similarity between them. Hamming distance between v1 and v2 is 4 and between v1 and v3 is also 4. But because I am interested in the groups of '1' which are together for me v2 is far more similar to v1 then v3 is.

Is there any distance metrics that can capture this in the data?

The data represent occupancy of house in time, that's why it is important to me. '1' means occupied, '0' means non occupied.

Guinna answered 11/5, 2013 at 11:29 Comment(4)

so what is the question ? – Crinum 11/5, 2013 at 11:31

sorry, already edited, if there is any distance metrics that can capture this – Guinna 11/5, 2013 at 11:32

I am interested in the groups of '1' which are together. Could you explain what you mean by that? 1 and 2 are more simliar because of the same amount of groups? – Sesqui 11/5, 2013 at 11:57

well basically yes, 1 and 2 are more similar there is same amount of groups. Because v2 is basically vector v1 only with the first group of '1' being "wider". V3 is almost empty vector – Guinna 11/5, 2013 at 12:16

It sounds like you need cosine similarity measure:

similarity = cos(v1, v2) = v1 * v2 / (|v1| |v2|)

where v1 * v2 is dot product between v1 and v2:

v1 * v2 = v1[1]*v2[1] + v1[2]*v2[2] + ... + v1[n]*v2[n]

Essentially, dot product shows how many elements in both vectors have 1 at the same position: if v1[k] == 1 and v2[k] == 1, then final sum (and thus similarity) is increased, otherwise it isn't changed.

You can use dot product itself, but sometimes you would want final similarity to be normalized, e.g. be between 0 and 1. In this case you can divide dot product of v1 and v2 by their lengths - |v1| and |v2|. Essentially, vector length is square root of dot product of the vector with itself:

|v| = sqrt(v[1]*v[1] + v[2]*v[2] + ... + v[n]*v[n])

Having all of these, it's easy to implement cosine distance as follows (example in Python):

from math import sqrt

def dot(v1, v2):
    return sum(x*y for x, y in zip(v1, v2))

def length(v):
    return sqrt(dot(v, v))

def sim(v1, v2): 
    return dot(v1, v2) / (length(v1) * length(v2))

Note, that I described similarity (how much two vectors are close to each other), not distance (how far they are). If you need exactly distance, you can calculate it as dist = 1 / sim.

Lutenist answered 11/5, 2013 at 12:34 Comment(7)

I believe it should be dist = 1 - sim and not 1 / sim – Magdaleno 7/8, 2014 at 8:8

@ThalisK.: both will work. The idea is that distance is in some sense inverse of similarity, so any inverse (and strictly monotone) function should work, and you can select concrete function based on your concrete interpretation of "distance". – Lutenist 7/8, 2014 at 12:33

Thanks. That makes sense. I would appreciate it if you would then take a look at this question: #25181604 – Magdaleno 7/8, 2014 at 12:54

if i have more than three vectors, like if i want to check v4 is different from v1, v2, and v3, can i apply your answer? – Jeddy 30/6, 2016 at 8:12

Cosine similarity is a pairwise distance measure, so you can use it to any number of vectors as long as you consider their pairs (e.g. v4 vs v1, v4 vs v2, etc.). If you want a measure that works with 3 or more vectors at the same time, you should be more specific about desirable properties of this measure. E.g. you may want an average distance of v4 from v1, v2 and v3 and that is as easy as (dist(v4, v1) + dist(v4, v2) + dist(v4, v3)) / 3. So it all depends on what exactly you want to achieve. – Lutenist 1/7, 2016 at 23:28

Any reason why your length function does not account for the sqrt? |v| = sqrt(v_1^2 + v_2^2 + ...) =/= v_1^2 + v_2^2 + ... – Regimen 1/2, 2017 at 4:44

You've just got a badge for watchfulness :) I've edited the code, thanks. – Lutenist 1/2, 2017 at 10:31

There are literally hundreds of distance functions, including distance measures for sets, such as Dice and Jaccard.

You may want to get the book "Dictionary of Distance Functions", it's pretty good.

Immunotherapy answered 12/5, 2013 at 13:25 Comment(3)

Looked up the title and didn't find it. Could your provide author, isbn, exact title please? – Magdaleno 7/8, 2014 at 8:7

First result on Google Books. ISBN: 9780080465548 – Immunotherapy 9/8, 2014 at 9:49

The book is a great reference. It has been updated and renamed as Encyclopedia of distances: link.springer.com/book/10.1007/978-3-662-52844-0 – Dispirited 2/12, 2018 at 3:13

Case 1: IF the position of the ones in the series is relevant, THEN:

I recommend Dynamic Time Warping Distance (DTW). In application of time-series data it has proven incredibly useful.

To check whether it can be applied to your problem, I used the code presented here: https://jeremykun.com/2012/07/25/dynamic-time-warping/

d13 = dynamicTimeWarp(v1,v3)
d12 = dynamicTimeWarp(v1,v2)
d23 = dynamicTimeWarp(v2,v3)

d23,d12,d13
(3, 1, 3)

As you see, d12 is lowest, therefore v1 and v2 are most similiar. Further information of DTW can be found anywhere in this forum and for research papers, I recommend anything by Eamonn Keogh.

Case 2: Position of ones is not relevant:

I simply agree Deepu for taking the average as a feature.

Discalced answered 25/7, 2016 at 13:3 Comment(0)

I think you can simply take the average of the values in each set. For example v1 here will have an average 0.4545, average of v2 is 0.6363 and average of v3 is 0.0909. If the only possible values in the set are 0 and 1, then the sets with equal or nearly equal values will serve your purpose.

Newel answered 11/5, 2013 at 11:41 Comment(3)

That's actually good idea, the problem I have though is that I have to mix the two metrics somehow together. Because vectors 0 0 1 1 and 1 1 0 0 would with average return both 0,5 and with my metrics 4 that all of them are displaced. Is it possible to somehow combine these two metrics that each yields half of the final value? Or is that too unpredictable? – Guinna 11/5, 2013 at 11:51

What about Standard Deviation? Will it help? – Newel 11/5, 2013 at 12:9

In a way I guess if the distribution below it was gaussian. But if I again take the the 0 0 1 1 and 1 1 0 0 example the std will have same results. I know how you mean it but then I would have to first cluster it make means of the clusters and then compare means and std of each cluster. But if such a complicated solution makes significant different. – Guinna 11/5, 2013 at 12:17

-1

There is a web site introducing the various type of vector similarity methods

http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

I think it will help you to decide what similarity you should use

Briefly explaining the above link, there are five popular similarity measurement between vectors

Euclidean Distance - Simply the absolute distance between the vectors
Cosine - Cosine degree(theta) difference between the vectors
Manhattan -the sum of the absolute differences of their Cartesian coordinates, for example,

In a plane with p1 at (x1, y1) and p2 at (x2, y2). Manhattan distance = |x1 – x2| + |y1 – y2|

Minkowski - generalized metric form of Euclidean distance and Manhattan distance
Jaccard - Similarity between the objects. So each feature in one set will be compared to another set and finds out its difference

With the keyword above you can google for further explanation. Hope it would help you

Appendage answered 17/4, 2018 at 6:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags