How to group by column having spelling mistakes
Asked Answered
P

2

5

While working with some legacy data, I want to group the data on a column ignoring spelling mistakes. I think SOUNDEX() could do the job to achieve the desired result. Here is what I tried:

SELECT soundex(AREA)
FROM MASTER
GROUP BY soundex(AREA)
ORDER BY soundex(AREA)

But (obviously) the SOUNDEX returned 4-character code in result rows like this, loosing actual strings:

A131
A200
A236

How could I include at least one occurrence from the group into the query result instead of 4-character code.

Phenomenal answered 8/5, 2013 at 14:38 Comment(2)
using MIN (or MAX) AREA would do the trick as answered by flaschenpost.Lucania
+1 For a clear question and teaching me about a function I didn't know existed.Settera
G
5
SELECT soundex(AREA) as snd_AREA, min(AREA) as AREA_EXAMPLE_1, max(AREA) as AREA_EXAMPLE_2
from MASTER
group by soundex(AREA)
order by AREA_EXAMPLE_1
;

In MySQL you could select group_concat(distinct AREA) as list_area to get all the versions, and I don't know about that in SQL-Server, but min and max give two examples of the areas, and you wanted to discard the diffs anyway.

Gordie answered 8/5, 2013 at 14:43 Comment(3)
It worked but with astonishing results. For e.g., the soundex() is grouping the NORTH NAZ and NORTH KAR, P.I.B. and P.E.C.H, SHAH FAISAL and S.I.T.E as same. ??? !!!!Phenomenal
Maybe you could control the soundex- function in SQL-Server. I only have access to MySQL and there I get soundex("NORTH NAZ") = N6352, soundex("NORTH KAR" = N6326. Or maybe there are even different soundex-functions availiable.Gordie
I tried removing spaces from the string (NORTHNAZ and NORTHKAR), but actually the SOUNDEX() is matching too loosely. For example, following strings produce same SOUNDEX codes; <<< select SOUNDEX(col1), col1 from ( select 'GULBAHAR' as col1 union ALL select 'GULBERG' ) t >>> ** <<< select SOUNDEX(col1), col1 from ( select 'GULSHANEIQBAL' as col1 union ALL select 'GULSHANEMAYMAR' ) t >>> ** Any idea, or I have to create new thread for this another issue?Phenomenal
W
5

You could also use row_number() to get one row for each soundex(area) value:

select AREA, snd
from
(
  select AREA, soundex(AREA) snd,
    row_number() over(partition by soundex(AREA)
                      order by soundex(AREA)) rn
  from master
) x
where rn = 1

See SQL Fiddle with Demo

Wight answered 8/5, 2013 at 14:46 Comment(2)
Your solution wouldn't include any occurrence from the group. I didn't want row number.Phenomenal
@Ali.NET You requested to receive one value from each group which this does. Did you see the demo?Wight

© 2022 - 2024 — McMap. All rights reserved.