How do I calculate the cosine similarity of two vectors?
Asked Answered
C

7

38

How do I find the cosine similarity between vectors?

I need to find the similarity to measure the relatedness between two lines of text.

For example, I have two sentences like:

system for user interface

user interface machine

… and their respective vectors after tF-idf, followed by normalisation using LSI, for example [1,0.5] and [0.5,1].

How do I measure the smiliarity between these vectors?

Crumple answered 6/2, 2009 at 13:15 Comment(0)
A
22
public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.

For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.

Accused answered 6/2, 2009 at 13:42 Comment(0)
U
72

If you want to avoid relying on third-party libraries for such a simple task, here is a plain Java implementation:

public static double cosineSimilarity(double[] vectorA, double[] vectorB) {
    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vectorA.length; i++) {
        dotProduct += vectorA[i] * vectorB[i];
        normA += Math.pow(vectorA[i], 2);
        normB += Math.pow(vectorB[i], 2);
    }   
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Note that the function assumes that the two vectors have the same length. You may want to explictly check it for safety.

Ulu answered 7/4, 2014 at 13:13 Comment(1)
Thanks, I just was too lazy to do it. :)Nervous
A
32

Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.

If you have vectors A and B.

The similarity is defined as:

cosine(theta) = A . B / ||A|| ||B||

For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)

For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;

So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:

  (a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)

Example:

A = (1, 0.5), B = (0.5, 1)

cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5
Asomatous answered 6/2, 2009 at 13:15 Comment(0)
A
22
public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.

For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.

Accused answered 6/2, 2009 at 13:42 Comment(0)
H
5

For matrix code in Java I'd recommend using the Colt library. If you have this, the code looks like (not tested or even compiled):

DoubleMatrix1D a = new DenseDoubleMatrix1D(new double[]{1,0.5}});
DoubleMatrix1D b = new DenseDoubleMatrix1D(new double[]{0.5,1}});
double cosineDistance = a.zDotProduct(b)/Math.sqrt(a.zDotProduct(a)*b.zDotProduct(b))

The code above could also be altered to use one of the Blas.dnrm2() methods or Algebra.DEFAULT.norm2() for the norm calculation. Exactly the same result, which is more readable depends on taste.

Hoey answered 6/2, 2009 at 13:34 Comment(0)
A
2

When I was working with text mining some time ago, I was using the SimMetrics library which provides an extensive range of different metrics in Java. If it happened that you need more, then there is always R and CRAN to look at.

But coding it from the description in the Wikipedia is rather trivial task, and can be a nice exercise.

Anfractuous answered 6/2, 2009 at 13:28 Comment(1)
It looks like your SimMetrics link rotted and now points to a spam blog about shoes. github.com/Simmetrics/simmetrics looks like a better one.Cloakanddagger
H
1

For the sparse representation of vectors using Map(dimension -> magnitude) Here is a scala version (You can do similar stuff in Java 8)

def cosineSim(vec1:Map[Int,Int],
              vec2:Map[Int,Int]): Double ={
  val dotProduct:Double = vec1.keySet.intersect(vec2.keySet).toList
    .map(dim => vec1(dim) * vec2(dim)).sum
  val norm1:Double = vec1.values.map(mag => mag * mag).sum
  val norm2:Double = vec2.values.map(mag => mag * mag).sum
  return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2))
}
Hoosegow answered 18/11, 2016 at 22:25 Comment(0)
B
0
def cosineSimilarity(vectorA: Vector[Double], vectorB: Vector[Double]):Double={
    var dotProduct = 0.0
    var normA = 0.0
    var normB = 0.0
    var i = 0

    for(i <- vectorA.indices){
        dotProduct += vectorA(i) * vectorB(i)
        normA += Math.pow(vectorA(i), 2)
        normB += Math.pow(vectorB(i), 2)
    }

    dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}

def main(args: Array[String]): Unit = {
    val vectorA = Array(1.0,2.0,3.0).toVector
    val vectorB = Array(4.0,5.0,6.0).toVector
    println(cosineSimilarity(vectorA, vectorA))
    println(cosineSimilarity(vectorA, vectorB))
}

scala version

Bracey answered 30/9, 2018 at 15:31 Comment(1)
This looks very java-like tbhMaher

© 2022 - 2024 — McMap. All rights reserved.