"Absolute" string metric
Asked Answered
A

8

6

I have a huge (but finite) set of natural language strings.

I need a way to convert each string to a numeric value. For any given string the value must be the same every time.

The more "different" two given strings are, the more different two corresponding values should be. The more "similar" they are, the less different values should be.

I do not know yet what exact definition of difference between strings I need. No natural language parsing anyway. It should probably be something Levenstein-like (but Levenstein is relative and I need absolute metric). Lets start with something simple.

Update on dimensions

I'll be happy to settle for a multidimensional (3d is best) vector instead of single numeric value.

Update on expected result correctness

As it was correctly noted here and here, the distance from one string to another is a vector with MAX(firstStringLength, secondStringLength) dimensions. In general it is not possible to reduce number of dimensions without some loss of information.

However I do not need an absolute solution. I would settle for any "good enough" conversion from N-dimensional strings space to my 3D space.

Note also that I have a finite number of strings of finite length. (Number of strings is rather large though, about 80 million (10 GB), so I'd better pick some single-pass state-less algorithm.)

From scanning references, I'm under impression that Hilbert space-filling curve may help me here. Looks like Analysis of the Clustering Properties of the Hilbert Space-Filling Curve article discusses something close to my problem...

Update on Hilbert curve approach

  1. We map each string to a point in a N-dimensional space, where N is the maximum length of a string in set. BTW, can i-th character code from a string be used as the i-th coordinate value here?
  2. We plot a Hilbert curve through that N-dimensional space.
  3. For each string we take point on the curve, closest to the coordinates of the string. Hilbert value of that point (the length from the beginning of curve) is the single-dimensional value I seek.
  4. If we need 3D value, we plot Hilbert curve in 3D and pick points, matching Hilbert values, calculated above.

Does this looks right? What would be the computational expenses here?

Acrospire answered 30/1, 2009 at 22:46 Comment(0)
E
5

I don't think this is possible to do. Start with a simple string, and assign it zero (it doesn't really matter what the number is)

  • "Hello World" = 0

The following strings are at distance 2 from it:

  • "XXllo World" = a
  • "HeXXo World" = b
  • "Hello XXrld" = c
  • "Hello WorXX" = d

Yet, each of these strings is 4 from each other. There is no way to sort the numbers to make it work, for the following instance:

a = 1, b = -1, c = 2, d = -2

Consider that c to 0 is 2, yet c to a is 1, yet 0 is closer than a.

And this is just a simple case.

Extravasate answered 30/1, 2009 at 23:11 Comment(0)
B
3

I think you are going to have to specify your problem more clearly, what exactly are you trying to achieve with this metric?

I say this, because Levenstein works since it maps pairs of strings to a metric, which can preserve the dimensionality of the string space. What happens if you try and map strings to numbers is that there is a large loss of dimensional information. For example, say I have the string "cat", I'd want "bat", "hat", "rat", "can", "cot" etc. to all be reasonably close to this. With a large number of words, the result is that you end up with dissimilar words being close relatively often e.g. "bat" and "cot" may be close, because they both happen to be similar distances from "cat" on the positive side. This is similar to the problem of what happens when you try and map the plane to a line, it is difficult to meet the restriction that points far away in the plane stay far away on the line. So, the upshot of this is that the 'The more "different" two given strings are, the more different two corresponding values should be' requirement is difficult.

So, my first suggestion is, do you really need something that does this, will a simple hash-code suffice to give you unique values, or perhaps you can use Levenstein after all and ignore the values for individual strings? If none of those suffice, perhaps you can use a multidimensional function value, that is map strings into pairs, triples or another small tuple of numbers. The extra dimensionality granted that way will give you far better results.

An example might be encoding the string as a triple: length, sum of values of letters in string, alternating sum of values of letters e.g. f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). This would have some of the properties you desire, but is probably not optimal. Try looking for orthogonal features of the string to do this encoding, or even better, if you have a large test set of strings there are existing libraries for mapping this sort of data into low dimensions while preserving metrics (e.g. the Levenstein metric) and you can train your function on that. I remember the S language had support for this sort of thing.

Bandaid answered 30/1, 2009 at 23:8 Comment(1)
Hmm. Looks like you're right. I'm trying to map a set of strings to a set of colors. So I'm all for multidimensional function value. See #496162Acrospire
G
3

So here I hope to show the fundamental problem and a solution.

The problem: you're correct to search for a 'good enough' solution as getting a perfect solution is impossible (I can show this in information theory, but I'll go for geometry as it's more readable). You have an N-dimensional space so distance metrics can't be projected without losing information:

distance projected onto X: (x,y,z).(1,0,0) = x

however, you can use vectors that take multiple dimensions into account but you then end up with elements far apart having the same distance:

(30,0,0).(1/3,1/3,1/3) = (0,30,0).(1/3,1/3,1/3) = (0,0,30).(1/3,1/3,1/3) = 10

So now for the solution: The best you can hope to do is cluster using Principle Component Analysis to find the three dimensions in which your strings differ the most. This again depends on the components of the distance metrics you use and is non-trivial (ie. I don't want to make this post even longer).

For a quick solution I suggest is you use the Levenstein distance from the 3 strings described below quickly attempts PCA in head:

"acegikmoqsuwy" //use half your permitted symbols then repeat until you have a string of size equal to your longest string.
"bdfhjlnprtv" //use the other half then repeat as above.
"" //The empty string, this will just give you the length of the string, so a cheap one.

Also, if you want to go deep, this on metrics/distances may help: http://www.springer.com/mathematics/geometry/book/978-3-642-00233-5

and levenstein distance demo: http://www.merriampark.com/ld.htm

Gloam answered 29/9, 2011 at 18:10 Comment(0)
U
2

I'd like to expand FryGuy's answer why it isn't gonna work in any fixed number of dimensions. Let's take aaaaaaaaaa and baaaaaaaaa, abaaaaaaaa, ... , aaaaaaaaab. In this example the strings are of length 10, but they can be of arbitrary length. The distance of each of the 10 b-strings from aaaaaaaaaa is 1, and their distance from each other is 2. In general, if you take fixed strings of length N over 2-letter alphabet, their distance graph is an N-dimensional hypercube.

There is no way you can map that into a fixed number of dimensions unless your strings are of limited length.

Unbraid answered 31/1, 2009 at 14:18 Comment(1)
Well, yes. But all I need is "good enough" mapping. (Of course my strings are of limited length -- there is even limited, but huge, number of strings themselves.)Acrospire
F
1

Measure the edit distance from the empty string, but instead of treating each edit as having the value "1", give it the index of the letter being added/removed in the alphabet sorted by use frequency (etaoinshrdlu...), and the difference between letter indices if your algorithm allows you to detect replacements as replacements rather than as insert+delete pairs.

Fumed answered 30/1, 2009 at 23:23 Comment(0)
A
1

You can also try to look into Latent Semantic Analysis and vector space models, with the problem that you have to limit the maximum string length.

Your dimensions are the product of the elements of your alphabet and the positions in the string. Given the alphabet ("a", "b", "c", "t") and a maximum length of 3, the dimensions are (a:1, b:1, c:1, t:1, ..., a:3, b:3, c:3, t:3)

As an example, "cat" becomes (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1).

This is of course a huge dataset, but you can use dimensionality reduction techniques (like SVD) to cut down the number of dimensions. This should work well, because there are many recurring patterns in words. You can tweak the number of output dimensions to suit your needs.

The similarity between two words can be computed by cosine similarity between the word vectors. You can also store the transformation vectors of the SVD to get the reduced vector for words, even previously unseen ones.

Acclaim answered 31/1, 2009 at 16:42 Comment(0)
T
0

To get over the 'relative distance' problem, all you need to do is take a fixed point to measure from.

You could still use the Levenstein distance, but take it from a fixed "Origin" string. For example, you could use an arbitrary length string of all spaces as your origin string.

Anyway, I'd test it out first with a small subset of known strings to see if the values reflect what you would expect to see.

Tello answered 30/1, 2009 at 22:51 Comment(3)
This is unsound, you will almost certainly end up with two very dissimilar strings sharing the same distance from the fixed point. e.g. if your fixed point is the empty string, any pair of strings with the same length will share the same value.Bandaid
Hmmm. That's a good point, but it's also a problem with any solution due to the reduction in dimensionality. Okay... How about having a second (and/or third) (different) origin string and using that... The problem is the dimensionality of the solution is reduced hugely.Tello
(continued) There is no easily solution to this due to the mapping of a multi-dimensional value (the range of strings) to a 1-dimensional space (the number line)Tello
A
-1

This is an "off the top of the head" answer to the question.

Basically, this calculates the distance sentence 2 differs from sentence 1, as a Cartesian distance from sentence 1 (assumed to be at the origin), where the distances are the sum of the minimum Levenshtein difference between the word in the 2 sentences. It has the property that 2 equal sentences give a 0 distance.

If this approach has been published elsewhere, I'm unaware of it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string str1 = "The cat sat on the mat";
            string str2 = "The quick brown fox jumped over the lazy cow";
            ReportDifference(str1, str1);
            ReportDifference(str2, str2);
            ReportDifference(str1, str2);
            ReportDifference(str2, str1);
        }
        /// <summary>
        /// Quick test andisplay routine
        /// </summary>
        /// <param name="str1">First sentence to test with</param>
        /// <param name="str2">Second sentence to test with</param>
        static void ReportDifference(string str1, string str2)
        {
            Debug.WriteLine(
                String.Format("difference between \"{0}\" and \"{1}\" is {2}", 
                str1, str2, Difference(str1, str2))); 
        }
        /// <summary>
        /// This does the hard work.
        /// Basically, what it does is:
        /// 1) Split the stings into tokens/words
        /// 2) Form a cartesian product of the 2 lists of words. 
        /// 3) Calculate the Levenshtein Distance between each word.
        /// 4) Group on the words from the first sentance
        /// 5) Get the min distance between the word in first sentence and all of the words from the second
        /// 6) Square the distances for each word. 
        ///     (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances
        ///     what this assumes is the first word is the origin)
        /// 7) take the square root of sum
        /// </summary>
        /// <param name="str1">sentence 1 compare</param>
        /// <param name="str2">sentence 2 compare</param>
        /// <returns>distance calculated</returns>
        static double Difference(string str1, string str2)
        {
            string[] splitters = { " " };

            var a = Math.Sqrt(
                (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     select new {x, y, ld = Distance.LD(x,y)} )
                    .GroupBy(x => x.x)
                    .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                    .Sum(s =>  (double)(s.min_match * s.min_match )));
            return a;
        }
    }

    /// <summary>
    /// Lifted from http://www.merriampark.com/ldcsharp.htm
    /// </summary>
    public class Distance
    {

        /// <summary>
        /// Compute Levenshtein distance
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        public static int LD(string s, string t)
        {
            int n = s.Length; //length of s
            int m = t.Length; //length of t
            int[,] d = new int[n + 1, m + 1]; // matrix
            int cost; // cost
            // Step 1
            if (n == 0) return m;
            if (m == 0) return n;
            // Step 2
            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;
            // Step 3
            for (int i = 1; i <= n; i++)
            {
                //Step 4
                for (int j = 1; j <= m; j++)
                {
                    // Step 5
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                    // Step 6
                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    }
}
Amato answered 31/1, 2009 at 3:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.