fuzzy DISTINCT Values

Asked 28/8, 2012 at 18:36 Answered 28/8, 2012 at 22:8

I have a database of real estate listings and need to return a list of neighborhoods. Right now I am using mysql DISTINCT which returns all of the distinct values. My probelm is that there is a lot of neighborhoods that have similar names: example:

Park View Sub 1
Park View
Park View Sub 2
Park View Sub 3
Great Lake Sub 1
Great Lake Sub 2
Great Lake 
Great Lake Sub 3

I am looking for an easy php or mysql solution that would recognize that "Park View" and "Great Lake" already exists and ONLY return "Park View" and "Great Lake".

My initial thought is to some how get the sort order by length so that the short values are at the top and then loop through using strstr. This sound like a large task I am wondering if there is a function either in mysql or php that would easily do this.

Cockadoodledoo answered 28/8, 2012 at 18:36 Comment(7)

could you add the output needed to your question for better understanding..? – Olympia 28/8, 2012 at 18:40

Is "Sub X" the only string that will be on the end, or is that text variable? – Waaf 28/8, 2012 at 18:44

@sshekhar: "ONLY return "Park View" and "Great Lake"." - that is the expected output. – Turn 28/8, 2012 at 18:44

thank you Travesty3. In regards to Sub x - no. That is just example. That could be anything like sub, flg, unit, bldg, etc. – Cockadoodledoo 28/8, 2012 at 18:48

@Cockadoodledoo How will you know what is and is not relevant string text, then? Restated, how should the solution determine what part of the text is important and what part is not? Is there an absolute list of "addon" text? Is there a character limit? I'm just not understanding how your code should determine that in "Park View Sub", "Sub" is not relevant, but in "Yellow Sub", "sub" should stay put. – Waaf 28/8, 2012 at 18:54

@Chris: +1 for using "Yellow Sub" as an example. Your point is also very relevant, but the +1 is particularly for the Beatles reference. – Turn 28/8, 2012 at 18:57

@user982853: this can be accomplished in MySQL. I expect that you will want to use the values that you return to be used in a predicate (WHERE clause) of a subsequent query. See my answer; I'd be happy to provide explanation of how this works. – Hartebeest 29/8, 2012 at 14:41

Here are some things you can try; presumably you're looking for both exact matches and close matches.

First look for an exact match. Then look for a LIKE match on the REVERSED name. Then look for the match with the fewest extra characters.

Here's a query that will do all that. Note that you will need to store the reversed place name in an indexed column if you want this to be efficient.

select name 
  from (
   select name, 0 ordinal
     from place 
    where name = 'Park View'
  union
  select name, 1 ordinal
    from place 
   where Reverse(Name) like concat(Reverse('Park View'),'%')
  union
  select name, 2+length(name)
    from place
   where name like concat('Park View','%')
 ) a 
order by ordinal
   limit 1

Notice how this UNION query uses ordinal to figure out the best match.

Check it out here: http://sqlfiddle.com/#!2/76a97/9/0

Spelaean answered 28/8, 2012 at 18:54 Comment(1)

It returns only park view bt it should return green lake too as that is also a distinct value. – Olympia 28/8, 2012 at 19:1

If you always have an entry without the 'Sub #' part, you could do something like this:

SELECT DISTINCT neighborhood FROM table WHERE neighborhood NOT LIKE '% Sub %';

To sort by string length:

SELECT DISTINCT neighborhood FROM table ORDER BY LENGTH(neighborhood);

Turn answered 28/8, 2012 at 18:41 Comment(1)

The only thing wrong with excluding Sub is that in the event "Park View Sub 1" is the only neighborhood, i want it to return that one. The only time i want them to exclude is if there is already a neighborhood containing it. – Cockadoodledoo 28/8, 2012 at 18:50

You can use PHP's similar_text to get a simple solution implemented. If you pre-sort your data so that the shorter, desired, addresses are first it should work well. Also, if "different" addresses aren't too similar, it will work better (but you can always up the threshold):

// if an address is 70% (or more) similar to another, it is not unique
$threshold = 70;

// list of addresses (and sorting them); this is done through the DB in your code
$addresses = array('Park View Sub 1', 'Park View', 'Park View Sub 2', 'Park View Sub 3', 'Great Lake Sub 1', 'Great Lake Sub 2', 'Great Lake', 'Great Lake Sub 3');
sort($addresses);

$unique = array();
foreach ($addresses as $address) {
    $isUnique = true;
    foreach ($unique as $u) {
        // get the similarity between the current address and each unique address
        similar_text($address, $u, $percent);
        if ($percent > $threshold) {
            // not unique; drop it
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) $unique[] = $address;
}

For other alternatives, you can also look into PHP's levenshtein and soundex, as well as MySQL's SOUNDEX().

Another, pseudo-fuzzy method is to have the addresses sorted alphabetically (either through MySQL or PHP) and loop through them one-by-one; if the current address begins-with the text of a unique-address already found, drop it. This works quite similarly to using an actual fuzzy-method, but it's more straight-to-the-point:

// list of addresses (and sorting them); this is done through the DB in your code
$addresses = array('Park View Sub 1', 'Park View', 'Park View Sub 2', 'Park View Sub 3', 'Great Lake Sub 1', 'Great Lake Sub 2', 'Great Lake', 'Great Lake Sub 3');
sort($addresses);

$unique = array();
foreach ($addresses as $address) {
    $isUnique = true;
    foreach ($unique as $u) {
        if (substr($address, 0, strlen($u)) == $u) {
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) $unique[] = $address;
}

This method will only work if they are sorted, as the shorter-address Park View would need to be found before Park View Sub 1. If your addresses are too similar to one another and the above similar_text method drops one too many, you can try this latter function as it's more strict.

Patton answered 28/8, 2012 at 18:51 Comment(0)

The example query below will get you the specified result set using MySQL, but it doesn't really do "fuzzy matching", at least, that's not how I would describe the algorithm. (This implements the algorithm you describe -- sorting by values, and then checking each value to see if the leading portion "matches" a previously retrieved value.)

This finds an "exact match" of the leading portion of the neighborhood value against the value from previously retrieved rows, there's not really any "fuzziness" about the match.

When the query encounters a value that is "unmatched", it marks that value is "unmatched". For the next value retrieved, it checks whether that value starts with the previously "unmatched" value; if the leading portion of the string is an exact match, the value is discarded. Otherwise, the value is marked as an "unmatched" value, and is kept.

This approach uses inline views (or "derived tables" as MySQL refers to them). The innermost inline view (aliased as s) gets us a sorted list of distinct values for neighborhood. The "trick" (if you want to call it that) is in the next inline view (aliased as "t") where we make use of MySQL user variables to reference a previously retrieved value.

To avoid any issues with "special characters", we do an equality comparison on the leading characters.

Here's the whole query:

SELECT t.neighborhood
  FROM (
         SELECT IF(IFNULL(LEFT(s.neighborhood,CHAR_LENGTH(@match)) <> @match,1),@match := s.neighborhood,NULL) AS neighborhood
           FROM (SELECT RTRIM(neighborhood) AS neighborhood
                   FROM mytable
                   JOIN (SELECT @match := NULL) r
                  GROUP BY neighborhood
                  ORDER BY neighborhood
                ) s
       ) t
 WHERE t.neighborhood IS NOT NULL

It's all really pretty straightforward, except for the initialization of the @match variable, and the expression that performs the comparison of the current value to a previous value.

If we aren't concerned with the corner cases introduced by special characters in the values, we can use a simpler LIKE or REGEXP to do the comparison:

s.neighborhood NOT LIKE CONCAT(@match,'%')

s.neighborhood NOT REGEXP CONCAT('^',@match)

The LIKE operator is subject to the underscore and percent characters, the REGEXP is subject to special characters used in regular expressions. To avoid those issue, the query above uses a comparison that is a bit more unwieldy looking:

LEFT(s.neighborhood,CHAR_LENGTH(@match)) <> @match

What that's doing is taking the previous value (e.g. @match := 'Park View') and comparing that to the leading portion (up to the length of 'Park View') of the next value, do determine whether it's a match.

One benefit of the approach with this query is that the values returned are guaranteed tp "match" in a predicate in a subsequent query. Say you're using this query to get a list of neighborhoods, and the user has selected one. This is going to return a set of values that will "match" to each and every row.

A subsequent query can use any of the returned values in a simple predicate (WHERE clause) to return rows that match. For example, if the user has selected the value 'Great Lake':

SELECT t.*
  FROM mytable t
 WHERE LEFT(t.neighborhood,CHAR_LENGTH('Great Lake') = 'Great Lake'

In the case where we used a LIKE or REGEXP predicate to match, we'd want to use the corresponding match in the predicate of the subsequent query:

SELECT t.*
  FROM mytable t
 WHERE t.neighborhood LIKE CONCAT('Great Lake','%')

SELECT t.*
  FROM mytable t
 WHERE t.neighborhood REGEXP CONCAT('^','Great Lake')

Hartebeest answered 28/8, 2012 at 22:8 Comment(0)

Recommended topics

Hot tags