How can we measure the similarity distance between categorical data ?

W

4

8

Example: Gender: Male, Female Numerical values: [0 - 100], [200 - 300] Strings: Professionals, beginners, etc,...

Thanks in advance.

Weis answered 21/4, 2015 at 11:46 Comment(0)

C

6

There are different ways to do this. One of the simplest would be as follows.

1) Assign numeric value to each property so the order matches the meaning behind the property if possible. It is important to order property values from lower to higher if property can be measured. If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.

P1 - Gender
-------------------
0 - Male
1 - Female

P2 - Experience
-----------
0 - Beginner
5 - Average
10 - Professional

P3 - Age
-----------
[0 - 100]

P4 - Body height, cm
-----------
[50 - 250]

2) For each concept find scale factor and offset so all property values fall in the same chosen range, say [0-100]

Sx = 100 / (Px max - Px min)
Ox = -Px min

In sample provided you would get:

S1 = 100
O1 = 0

S2 = 10
O2 = 0

S3 = 1
O3 = 0

S4 = 0.5
O4 = -50

3) Now you can create a vector containing all the property values.

V = (S1 * P1 + O1, S2 * P2 + O2, S3 * P3 + O3, S4 * P4 + O4)

In sample provided it would be:

V = (100 * P1, 10 * P2, P3, 0.5 * P4 - 50)

4) Now you can compare two vectors V1 and V2 by subtracting one from other. The length of resulting vector will tell how different they are.

delta = |V1 - V2|

Vectors are subtracted by subtracting each dimension. Vector length can be calculated as square root of sum of squared vector dimensions.

Imagine we have 3 persons:

John
P1 = 0 (male)
P2 = 0 (beginner)
P3 = 20 (20 years old)
P4 = 190 (body height is 190 cm)

Kevin
P1 = 0 (male)
P2 = 10 (professional)
P3 = 25 (25 years old)
P4 = 186 (body height is 186 cm)

Lea
P1 = 1 (female)
P2 = 10 (professional)
P3 = 40 (40 years old)
P4 = 178 (body height is 178 cm)

Vectors would be:

J = (100 * 0, 10 * 0, 20, 0.5 * 190 - 50) = (0, 0, 20, 45)
K = (100 * 0, 10 * 10, 25, 0.5 * 186 - 50) = (0, 100, 25, 43)
L = (100 * 1, 10 * 10, 40, 0.5 * 178 - 50) = (100, 100, 40, 39)

To determine we need to subtract vectors:

delta JK = |J - K| =
= |(0 - 0, 0 - 100, 20 - 25, 45 - 43)| = 
= |(0, -100, -5, 2)| =
= SQRT(0 ^ 2 + (-100) ^ 2 + (-5) ^ 2 + 2 ^ 2) = 
= SQRT(10000 + 25 + 4) = 
= 100,14

delta KL = |K - L| = 
= |(0 - 100, 100 - 100, 25 - 40, 43 - 39)| = 
= |(-100, 0, -15, 4)| =
= SQRT((-100) ^ 2 + 0 ^ 2 + (-15) ^ 2 + 4 ^ 2) =
= SQRT(10000 + 225 + 16) =
= 101,20

delta LJ = |L - J| = 
= |(100 - 0, 100 - 0, 40 - 20, 39 - 45)| = 
= |(100, 100, 20, -6)| =
= SQRT(100 ^ 2 + 100 ^ 2 + (20) ^ 2 + (-6) ^ 2) =
= SQRT(10000 + 10000 + 400 + 36) =
= 142,95

From this you can see that John and Kevin are more similar than any other as delta is smaller.

Cavil answered 21/4, 2015 at 14:29 Comment(2)

I Think the scale factor formula should be applied like this: Sx * ( Px + Ox) – Francophile 20/2, 2018 at 16:44

This part is very wrong "If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.". If its nominal, assigning numerical variable adds a weight to it which is never true. Better, do one hot encoding. – Diencephalon 3/6, 2020 at 9:0

P

7

There are a number of measures for finding similarity between categorical data. The following paper discuses briefly about these measures.

https://conservancy.umn.edu/bitstream/handle/11299/215736/07-022.pdf?sequence=1&isAllowed=y

If you're trying to do this in R, there's a package named 'nomclust', which has all these similarity measures readily available.

Hope this helps!

Pall answered 24/2, 2019 at 8:24 Comment(2)

Is there a working link to this paper by any chance? – Acton 20/12, 2020 at 19:29

Any package in python? – Maurits 9/2, 2021 at 17:17

C

6

There are different ways to do this. One of the simplest would be as follows.

1) Assign numeric value to each property so the order matches the meaning behind the property if possible. It is important to order property values from lower to higher if property can be measured. If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.

P1 - Gender
-------------------
0 - Male
1 - Female

P2 - Experience
-----------
0 - Beginner
5 - Average
10 - Professional

P3 - Age
-----------
[0 - 100]

P4 - Body height, cm
-----------
[50 - 250]

2) For each concept find scale factor and offset so all property values fall in the same chosen range, say [0-100]

Sx = 100 / (Px max - Px min)
Ox = -Px min

In sample provided you would get:

S1 = 100
O1 = 0

S2 = 10
O2 = 0

S3 = 1
O3 = 0

S4 = 0.5
O4 = -50

3) Now you can create a vector containing all the property values.

V = (S1 * P1 + O1, S2 * P2 + O2, S3 * P3 + O3, S4 * P4 + O4)

In sample provided it would be:

V = (100 * P1, 10 * P2, P3, 0.5 * P4 - 50)

4) Now you can compare two vectors V1 and V2 by subtracting one from other. The length of resulting vector will tell how different they are.

delta = |V1 - V2|

Vectors are subtracted by subtracting each dimension. Vector length can be calculated as square root of sum of squared vector dimensions.

Imagine we have 3 persons:

John
P1 = 0 (male)
P2 = 0 (beginner)
P3 = 20 (20 years old)
P4 = 190 (body height is 190 cm)

Kevin
P1 = 0 (male)
P2 = 10 (professional)
P3 = 25 (25 years old)
P4 = 186 (body height is 186 cm)

Lea
P1 = 1 (female)
P2 = 10 (professional)
P3 = 40 (40 years old)
P4 = 178 (body height is 178 cm)

Vectors would be:

J = (100 * 0, 10 * 0, 20, 0.5 * 190 - 50) = (0, 0, 20, 45)
K = (100 * 0, 10 * 10, 25, 0.5 * 186 - 50) = (0, 100, 25, 43)
L = (100 * 1, 10 * 10, 40, 0.5 * 178 - 50) = (100, 100, 40, 39)

To determine we need to subtract vectors:

delta JK = |J - K| =
= |(0 - 0, 0 - 100, 20 - 25, 45 - 43)| = 
= |(0, -100, -5, 2)| =
= SQRT(0 ^ 2 + (-100) ^ 2 + (-5) ^ 2 + 2 ^ 2) = 
= SQRT(10000 + 25 + 4) = 
= 100,14

delta KL = |K - L| = 
= |(0 - 100, 100 - 100, 25 - 40, 43 - 39)| = 
= |(-100, 0, -15, 4)| =
= SQRT((-100) ^ 2 + 0 ^ 2 + (-15) ^ 2 + 4 ^ 2) =
= SQRT(10000 + 225 + 16) =
= 101,20

delta LJ = |L - J| = 
= |(100 - 0, 100 - 0, 40 - 20, 39 - 45)| = 
= |(100, 100, 20, -6)| =
= SQRT(100 ^ 2 + 100 ^ 2 + (20) ^ 2 + (-6) ^ 2) =
= SQRT(10000 + 10000 + 400 + 36) =
= 142,95

From this you can see that John and Kevin are more similar than any other as delta is smaller.

Cavil answered 21/4, 2015 at 14:29 Comment(2)

I Think the scale factor formula should be applied like this: Sx * ( Px + Ox) – Francophile 20/2, 2018 at 16:44

This part is very wrong "If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.". If its nominal, assigning numerical variable adds a weight to it which is never true. Better, do one hot encoding. – Diencephalon 3/6, 2020 at 9:0

M

2

If you are using python, there is a latest library which helps in finding the proximity matrix based on similarity measures such as Eskin, overlap, IOF, OF, Lin, Lin1, etc. After obtaining the proximity matrix we can go on clustering using Hierarchical Cluster Analysis.

Check this link to the library named "Categorical_similarity_measures": https://pypi.org/project/Categorical-similarity-measures/0.4/

Malevolent answered 16/3, 2020 at 8:33 Comment(0)

K

0

Just a thought, We can also apply euclidean distance between two variables to find a drift value. If it is 0, then there is no drift or else call as similar. But the vector should be sorted and same length before calculation.

Kiaochow answered 20/4, 2021 at 19:17 Comment(0)

Recommended topics

Hot tags