Merging duplicated records together with "Merge" syntax
Asked Answered
K

9

14

I am using SQL Server 2014. I am currently trying to combine millions of personnel application records in to a single personnel record.

The records contain the following columns:

ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email

A person can enter their details numerous times but due to fat fingers or fraud they can sometimes put in, incorrect details.

In my example Christopher has filled his details in 5 times, First_Name, Last_Name, DOB are always correct, Post_Code, Mobile and Email contain various connotations.

What I want to do is take the min(id) associated with this group in this case 84015283 and put it in to a new table, this will be the primary key and then you will see the other id's that are associated with it.

Examples

NID       CID
------------------
84015283  84015283
84015283  84069198
84015283  84070263
84015283  84369603
84015283  85061159

Where it gets a little complicated is, where 2 different people can have the same First_Name, Last_Name and DOB, at least one of the other fields must match "post_code, mobile or email" as per my example to another record within the group.

Though first_name, last_name, DoB match between ID's 84015283, 84069198, 84070263. 84015283, 84069198 are identical so they would match without an issue, 84070263 matches on the postcode, 84369603 matches on the mobile to a previous record and 85061159 matches on a previous mobile/email but not post_code.

If putting the NID within the original dataset is easier I can go with this rather than putting it all in a separate table.

After some googling and trying to get my head around this, I believe that using "Merge" might be a good way to achieve what I am after but I am concerned it will take a very long time due to the number of records involved.

Also going forward any routine would have to be run on subsequent new records.

I have listed the code for the example if anyone can help

DROP TABLE customer_dist

CREATE TABLE [dbo].customer_dist
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL,
)

INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]')

SELECT * FROM customer_dist

Below is the expected results, sorry I should of made it clearer what I wanted at the end.

Output Table Results

    NID         id          First_Name  Last_Name   DoB         post_code   mobile          Email
    84015283    84015283    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84069198    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  [email protected]
    84015283    84070263    Christopher Higg            1/13/1956   CH2 3AZ         7089559822  [email protected]
    84015283    84369603    Christopher Higg            1/13/1956   CH2 3ZA         7089559829  [email protected]
    84015283    85061159    CHRISTOPHER Higg            1/13/1956   CH2 3RA         7089559829  [email protected]
    78065122    87065122    Matthew Davis               05/10/1978  CH5 1TS

7077084692 [email protected]

OR                          

NID         id
84015283    84015283
84015283    84069198
84015283    84070263
84015283    84369603
84015283    85061159
87065122    87065122

Apologies for the slow response.

I have updated my required output, I was asked to include an extra record that was not a match to the other records but did not include this in my required output.

HABO's response was the closest to what was needed unfortunately on further testing with other sample data, duplicates were created and the logic broke down. Other Sample data would be :-

declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );


INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07706212920',  '[email protected]'),
       ('35963960', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863324',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]'),
       ('46653041', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('48023677', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('49560434', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('49861032', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('53130969', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('33843283', 'Mary', 'Wilson',   '1983-09-20',   'BT148HU',  '07484863478',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]')

SELECT * FROM @customer_dist;
Kerwinn answered 22/8, 2018 at 14:20 Comment(14)
You did a great job posting data and some details but I have no idea what you expect as output. What should be the output from your sample data?Passim
Hi Sean, thanks for the reply and apologies that I didn't make clear what I wanted at the end, hope the further details helpKerwinn
Definitely makes it more clear. It would also help to have at least one example that is not the same person but I think I understand the requirements well enough.Passim
Hi, Is this Ongoing process or one time process.Rescind
Hi Sean, I have added in another new record as requested, hope that helps.Kerwinn
This will be an ongoing process once the initial data has been corrected.Kerwinn
How do you define multiple ID is actually the same person? if any of first_name, last_name, OR dob is the same?Footstep
Yes and as long as 1 other piece of information matches either postcode, phone or email. If your second record matches on e.g. mobile number to the first record then its a match, then your 3 record wont match on the phone but will match on the email of your original record. Your 4th record might match on details in your second record, but as your second record matched your first you still know that this is all the same person.Kerwinn
It might be clearer if you can define it in form of decision tree so we and especially you, can make this query using self join and sub query join.Footstep
If you could add to the sample data some more rows to demonstrate different people (if I understand the question, there can be another Christopher Higg born on the same date with with all other details different the the current Christopher Higg [where all rows belong to the same person]) that would helpNonsense
don't concentrate on "Merge".Merge may or may not be good idea,the way your data is.I will suggest to throw few more sample data and update your output accordingly.Dysfunction
Have you looked at SQL Server Master Data Services, allows you to define these sorts of matching algorithms and create an action plan.Acidify
You will run into problems with Merge if there exists more than one record with the Key(s) that you are merging on.Unholy
78065122 87065122 Matthew Davis?Guanine
L
1

I used to work in a very old school insurance company where they had similar issue with data.

My main attempt here is to narrow the result set with duplicates in it, kind of finding what ties dups together. Once you get this, the rest of the solutions comes really fast.

Logic is: join table to itself base on columns that share the same values (Fname, Lname, DOB) and occasionally same values (post_code, mobile, email) and more importantly ids should NOT match, this makes sure to exclude non-dup records and leave only dups.

After you have dups only, find MIN(id) and put it inside cte, join to the original table and there you are. Non-dup records don't need min-id because it's id is min-id.

;WITH DUPS AS
(
SELECT DISTINCT
    MIN(C1.ID) OVER(PARTITION BY C1.First_Name, 
C1.Last_Name, C1.DoB) AS minid,
    C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1 
INNER join customer_dist c2 
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile 
OR 
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)

SELECT ISNULL(D.minid, C.ID) AS NID,
        C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id
Linnette answered 23/12, 2018 at 4:23 Comment(0)
B
0

This is not an answer but a comment that is too long to fit in the comments section.

Since the "equality" condition is complex, I think I would do it in phases:

  1. Create the "buckets" of similar customers. A bucket identifies all customer with identical id, first_name, last_name, and dob. Add an index on the new "key" column for faster grouping. A bucket may contain one or more real customers.

    select
        cast(id as varchar(10)) +
        lower(first_name) + 
        lower(last_name) + 
        convert(varchar, dob, 23) as k,
        id, post_code, mobile, email
        into bucket
      from customer_dist;
    
    create index ix1 on bucket(k);
    
  2. Work on each bucket and separate the customers on each one. Most likely there's a single one, but can be multiple in it.

Here you'll need to run some iterative algorithm to compare rows, mark them as equal groups or different one, and eventually consolidate groups into single ones. All of this is possible, but I'm afraid I don't see how to do it simply in SQL.

You'll need to do some coding here.

Baluster answered 22/8, 2018 at 15:5 Comment(0)
G
0

The following example uses a CTE to pair rows (by joining the table with itself) that have matching column values as per the requirements. In each pair the "left" row precedes the "right" in Id order, hence avoiding duplicate results which differ only in having swapped Id values.

The results of the CTE are then combined with an extra row for each group of matching rows to provide the curious extra row that matches itself, i.e. NId = Id.

-- Sample data.
declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]');

SELECT * FROM @customer_dist;

-- Process the data.
with PairedRows as (
  -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on
        -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.Id > CDL.Id and
        -- "Must match" columns.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and
        -- Plus at least one optional match.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email )
    -- Where there is not a prior row (in   Id   order) that matches the "left" row.
    where not exists (
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and 
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
  select NId, Id -- The paired rows.
    from PairedRows
  union all
  -- Add the   NId   row as a match to itself for every group of paired rows.
  select Min( NId ) as NID, Min( NId ) as Id
    from PairedRows
    group by NId
  order by NID, Id;

Chasing the dancing question section.

The following adds anyone not in a pair to the output through another union all:

-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
    where not exists ( -- Where there is not a ...
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in   Id   order) that matches the "left" row.
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
  select NId, Id -- The paired rows.
    from PairedRows
  union all
  select Min( NId ) as NID, Min( NId ) as Id -- Add the   NId   row as a match to itself for every group of paired rows.
    from PairedRows
    group by NId
  union all
  select id, id -- Toss in anyone we haven't heard of.
    from @customer_dist as CD
    where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id )
  order by NID, Id;

And yet one more mashup to display the reason for each output row:

-- Sample data.
declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07706212920',  '[email protected]'),
       ('35963960', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863324',  '[email protected]'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]'),
       ('46653041', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('48023677', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07483888179',  '[email protected]'),
       ('49560434', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('49861032', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('53130969', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  '[email protected]'),
       ('33843283', 'Mary', 'Wilson',   '1983-09-20',   'BT148HU',  '07484863478',  '[email protected]'),
       -- NB: Unique   Id   in the following row.
       ('386279750', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  '[email protected]');

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]');

SELECT * FROM @customer_dist;
select ( select Count(*) from @customer_dist ) as TotalRows, ( select Count( distinct id ) from @customer_dist ) as DistinctIds;

-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
    where not exists ( -- Where there is not a ...
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in   Id   order) that matches the "left" row.
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) ),
  Results as (
    select NId, Id, 'Paired' as Reason -- The paired rows.
      from PairedRows
    union all
    select Min( NId ) as NID, Min( NId ) as Id, 'Self' -- Add the   NId   row as a match to itself for every group of paired rows.
      from PairedRows
      group by NId
    union all
    select id, id, 'Other' -- Toss in anyone we haven't heard of.
      from @customer_dist as CD
      where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id ) )
  select R.NId, R.Id, R.Reason,
    CDL.First_Name, CDL.Last_Name,
    case when CDL.DoB = CDR.DoB then '=' else '' end as MatchDoB, -- Must match.
    case when CDL.post_code = CDR.post_code then '=' else '' end as MatchPostCode,
    case when CDL.mobile = CDR.mobile then '=' else '' end as MatchMobile,
    case when CDL.Email = CDR.Email then '=' else '' end as MatchEmail,
    case when CDL.id = CDR.id then '==' else '' end as MatchSelf,
    case when ( select Count(*) from Results as IR where IR.NId = R.NId and IR.Id = R.Id ) > 1 then '#' else '' end as 'Duplicate'
    from Results as R inner join
      @customer_dist as CDL on CDL.id = R.NId inner join
      @customer_dist as CDR on CDR.id = R.Id
    order by NID, Id;
Guanine answered 2/9, 2018 at 1:5 Comment(4)
Interesting process to achieve results, as per the edited question, though I added and extra person "Matthew Davis" in to the table, I did not show that this record also needs to be outputted. Also, is it possible to compare the incoming data to a second table, either in line or later, please see edited question.Kerwinn
@Kerwinn I've added another query to the answer that picks up any stragglers who aren't paired with anyone else. As for the new question you've posted in your comment, post a new question. That's the way SO works. Please include your work in the new question with a clear explanation of what isn't working.Guanine
Sorry again for the slow reply, with the initial test data the outcome works correctly but upon further testing the output is producing duplicate records. I have amended my initial question with new test data. This code duplicates with an output of NId 33843283 Id 38627975, inserted are 10 rows but output is 12. Also ID 33843283 should have a NID of 32006455 and ID 38627975 should have a NID of 32006455 as they match on mobile to line 3 from the initial load.Kerwinn
The new test data has two rows with the same Id (38627975). Granted, you didn't say that Id was unique, it was just my assumption. Since you want to reduce the data to pairs of Id values it will be interesting to do anything meaningful with the results if a single Id value corresponds to multiple unrelated rows. I've added a Wall of Code to my answer to allow you to continue your investigations.Guanine
A
0

Try this (necessary comments are in code):

;with cte as (
    SELECT 1 n, 84015283 CID, * FROM @tbl
    where id = 84015283
    union all 
    select c.n + 1, 84015283, t.* from cte c
    join @tbl t on
        c.First_Name = t.first_name and
        c.Last_Name = t.Last_name and
        c.DoB = t.DoB and (
        c.post_code = t.post_code or
        c.mobile = t.mobile or
        c.Email = t.Email 
        ) and
        --there is no way of writing stop condition here,
        --as joining will return in some rows every time,
        --so you have to enter here number big enough for
        --query to join all records, here 1 suffices
        --(if you enter bigger number, result will stay the same
        --due to distinct in select)
        c.n <= 1
)

select distinct CID, 
                id NID, 
                First_Name, 
                Last_Name, 
                DoB, 
                post_code, 
                mobile, 
                Email 
from cte

Alternate approach is using while loop:

declare @tempTable table
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL
);
insert into @tempTable
select *
from @customer_dist

declare @inserted int = -1;
while @inserted <> (select count(*) from @tempTable)
begin
    select @inserted = count(*) from @tempTable
    insert into @tempTable
    select c.* from @customer_dist c
    where exists(select 1 from @tempTable t
                 where c.First_Name = t.first_name and
                       c.Last_Name = t.Last_name and
                       c.DoB = t.DoB and (
                       c.post_code = t.post_code or
                       c.mobile = t.mobile or
                       c.Email = t.Email 
                       )
                 )
    except
    select * from @tempTable
end

select MAX(NID) over (partition by first_name,last_name) NID,
       id, First_Name, Last_Name, DoB, post_code, mobile, Email
from (
    select (case when ROW_NUMBER() over (partition by first_name,last_name order by (select null)) = 1 then 1 else 0 end) * id NID,
           *
    from @tempTable
) a

select * from @tempTable

It loops as long as there are new records added to @tempTable. With your sample data it loops just once.

The difference to previous query is that at every step of a loop it will take only new records thanks to except, which cannot be used in CTE.

Also it performs better, because it uses exists to determine which rows still to add. In CTE it is not allowed, since CTE cannot appear in subqueries.

And, most of all, it will guarantee you that you won't miss any records! In cte you had to constrain it with c.n < 1 and that could be a risk of loosing records.

Archdeaconry answered 2/9, 2018 at 19:13 Comment(4)
Works as expected but it should also be outputting the Matthew Davis record, unfortunately, I put this in to the table but did not show this as a required output.Kerwinn
@Kerwinn Why? It doesn't align with your logic.Silk
As there could be millions of records, having 84015283 hard coded wouldn't work when I come to add more people to the process, e.g. Matthew Davis does not show in the final output. The logic appears correct but would not work for more than 1 group of people at a time.Kerwinn
@Kerwinn I updated second solution to match your needs as I think it is more reliable method and it's more efficient (should be :) ).Silk
L
0

[dbo].[LEVENSHTEIN]

CREATE FUNCTION [dbo].[LEVENSHTEIN](@left  VARCHAR(100),
                                @right VARCHAR(100))
RETURNS INT
AS
  BEGIN
      DECLARE @difference    INT,
              @lenRight      INT,
              @lenLeft       INT,
              @leftIndex     INT,
              @rightIndex    INT,
              @left_char     CHAR(1),
              @right_char    CHAR(1),
              @compareLength INT

      SET @lenLeft = LEN(@left)
      SET @lenRight = LEN(@right)
      SET @difference = 0

      IF @lenLeft = 0
        BEGIN
            SET @difference = @lenRight

            GOTO done
        END

      IF @lenRight = 0
        BEGIN
            SET @difference = @lenLeft

            GOTO done
        END

      GOTO comparison

      COMPARISON:

      IF ( @lenLeft >= @lenRight )
        SET @compareLength = @lenLeft
      ELSE
        SET @compareLength = @lenRight

      SET @rightIndex = 1
      SET @leftIndex = 1

      WHILE @leftIndex <= @compareLength
        BEGIN
            SET @left_char = SUBSTRING(@left, @leftIndex, 1)
            SET @right_char = SUBSTRING(@right, @rightIndex, 1)

            IF @left_char <> @right_char
              BEGIN -- Would an insertion make them re-align?
                  IF( @left_char = SUBSTRING(@right, @rightIndex + 1, 1) )
                    SET @rightIndex = @rightIndex + 1
                  -- Would an deletion make them re-align?
                  ELSE IF( SUBSTRING(@left, @leftIndex + 1, 1) = @right_char )
                    SET @leftIndex = @leftIndex + 1

                  SET @difference = @difference + 1
              END

            SET @leftIndex = @leftIndex + 1
            SET @rightIndex = @rightIndex + 1
        END

      GOTO done

      DONE:

          RETURN @difference
      END

    GO

[dbo].[GetPercentageOfTwoStringMatching]

CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(100)
    ,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1)
    , @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] (   @string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters
END
GO

Query

    DECLARE @customer_dist TABLE
    (
        [id] [INT] NOT NULL ,
        [First_Name] [VARCHAR](50) NULL ,
        [Last_Name] [VARCHAR](50) NULL ,
        [DoB] [DATE] NULL ,
        [post_code] [VARCHAR](50) NULL ,
        [mobile] [VARCHAR](50) NULL ,
        [Email] [VARCHAR](100) NULL
    );

INSERT INTO @customer_dist ( id ,
                             First_Name ,
                             Last_Name ,
                             DoB ,
                             post_code ,
                             mobile ,
                             Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , '[email protected]' ) ,
       ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , '[email protected]' ) ,
       ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559822' , '[email protected]' ) ,
       ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
         '07089559829' , '[email protected]' ) ,
       ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
         '07089559829' , '[email protected]' ) ,
       ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
         '07077084692' , '[email protected]' ) ,
       ( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
         '08009777337' , '[email protected]' );



SELECT result.* ,
       [dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM   (   SELECT (   SELECT MIN(id)
                      FROM   @customer_dist AS sq
                      WHERE  sq.First_Name = cd.First_Name
                             AND sq.Last_Name = cd.Last_Name
                             AND (   sq.mobile = cd.mobile
                                     OR sq.Email = cd.Email
                                     OR sq.post_code = cd.post_code )) nid ,
                  *
           FROM   @customer_dist AS cd ) AS result
       INNER JOIN @customer_dist d ON result.nid = d.id;

Result Result

Second query

    DECLARE @customer_dist TABLE
    (
        [id] [INT] NOT NULL ,
        [First_Name] [VARCHAR](50) NULL ,
        [Last_Name] [VARCHAR](50) NULL ,
        [DoB] [DATE] NULL ,
        [post_code] [VARCHAR](50) NULL ,
        [mobile] [VARCHAR](50) NULL ,
        [Email] [VARCHAR](100) NULL
    );

INSERT INTO @customer_dist ( id ,
                             First_Name ,
                             Last_Name ,
                             DoB ,
                             post_code ,
                             mobile ,
                             Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , '[email protected]' ) ,
       ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , '[email protected]' ) ,
       ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559822' , '[email protected]' ) ,
       ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
         '07089559829' , '[email protected]' ) ,
       ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
         '07089559829' , '[email protected]' ) ,
       ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
         '07077084692' , '[email protected]' ) ,
       ( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
         '08009777337' , '[email protected]' );



SELECT result.* ,
       [dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM   (   SELECT (   SELECT MIN(id)
                      FROM   @customer_dist AS sq
                      WHERE  sq.First_Name = cd.First_Name
                             AND sq.Last_Name = cd.Last_Name
                             AND (  sq.DoB = cd.DoB   
                                     OR sq.mobile = cd.mobile
                                     OR sq.Email = cd.Email
                                     OR sq.post_code = cd.post_code )) nid ,
                  *
           FROM   @customer_dist AS cd ) AS result
       INNER JOIN @customer_dist d ON result.nid = d.id;

Result: enter image description here

Luffa answered 3/9, 2018 at 16:17 Comment(1)
Can see 2 issues but almost there nid 94015281/id 94015281, this should have an nid of 84015283 as first/last and dob match as well as Email. On this line the %Match seems off? Also as mentioned in the original question the check needs to look at previous matched records for example change line 5 so that the mobile reads 07089559822 and email to [email protected]. This changes the nid to 84070263, this nid matches line 3 id, but this nid is 84015283, this would need to be the nid associated with line 5.Kerwinn
U
0

Since you had mentioned that your "group" is primarily based on three columns: FirstName, LastName, and DOB, you can create a View to keep track of the minimum ID for all records, and use that view whenever you would like to perform additional processing.

You can also create a CTE. It all depends on how you intend to use your result set.

I will not try to update existing records in customer_dist table since it will serve as a raw table in case you would like to go back and look at the exact data that the users have entered at different points in time (if you care about statistics/data trends)

Query in either approach:

SELECT 
  MIN(id) AS Min_Id,
  LOWER(First_Name) AS firstName, LOWER(Last_Name) As lastName, DoB
FROM
customer_dist
GROUP BY 
LOWER(First_Name), LOWER(Last_Name), DoB;

View example

CTE example

Unholy answered 5/9, 2018 at 17:52 Comment(0)
S
0

If you use a UNION, it will be a heavy operation but can dedup duplicate rows.

Also, I'd strongly suggest looking at using "fuzzy logic" using SSIS. It is a much more proven method of identifying near duplicates. Here's just one example I was able to find on youtube to point you in the right direction. I hope this helps.

https://www.youtube.com/watch?v=eVOmXssmB7I

Sonics answered 6/10, 2018 at 23:10 Comment(0)
U
0

Probably the most the most elegant solution OVER PARTITION BY to match them. Normally that is straight forward if all of your criteria can be AND'd together. Since your requires some OR logic on the post_code, mobile, and email columns you'll need to add a few extra steps.

First find the MIN() match for each of the three ways they can match

    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd

Now you have a result set that shows you each ID and the lowest matching ID based on three different sets of criteria: Lowest ID match on three different criteria

We know that the minimum matching ID across each of the three criteria is the one we want so...

Wrap your query with some cross apply style magic

SELECT
    NID = (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    ),
    cd.*
FROM    (
    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd
order BY (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)

And you have your results: enter image description here

You can use these results to either create a lookup/xref table, or you can add a NID column to your original table and merge these results into it.

Complete query with some extra data for testing

DECLARE @customer_dist AS table (
    id          int             NOT NULL,
    First_Name  varchar(50)     NULL,
    Last_Name   varchar(50)     NULL,
    DoB         date            NULL,
    post_code   varchar(50)     NULL,
    mobile      varchar(50)     NULL,
    Email       varchar(100)    NULL
);


INSERT INTO @customer_dist ( id, First_Name , Last_Name, DoB, post_code, mobile, Email )
VALUES
    ( '32006455', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07706212920', '[email protected]' ),
    ( '35963960', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863324', '[email protected]' ), 
    ( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', '[email protected]' ), 
    ( '46653041', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07483888179', '[email protected]' ), 
    ( '48023677', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07483888179', '[email protected]' ), 
    ( '49560434', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', '[email protected]' ), 
    ( '49861032', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07849727199', '[email protected]' ), 
    ( '53130969', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', '[email protected]' ), 
    ( '33843283', 'Mary', 'Wilson', '1983-09-20', 'BT148HU', '07484863478', '[email protected]' ), 
    ( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', '[email protected]' ), 
    ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]' ), 
    ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', '[email protected]' ), 
    ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', '[email protected]' ), 
    ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', '[email protected]' ), 
    ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', '[email protected]' ), 
    ( '84369605', 'Christopher', 'Higg', '1956-01-13', 'CH2 ZZZ', '07089559999', '[email protected]' ), 
    ( '84369677', 'Christopher', 'Higg', '1956-01-13', 'AH2 ZZZ', '09089559999', '[email protected]' ), 
    ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]' ),
    ( '87065123', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', '[email protected]' )

SELECT
    NID = (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    ),
    cd.*
FROM    (
    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd
order BY (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    )
Urey answered 29/9, 2020 at 17:51 Comment(0)
M
0

This ultimately seems like it is a data ranking problem to me. With that in mind we can use the DENSE_RANK windowing function to determine how to group our accounts together. The following example show how it can be potentially done.

;WITH CustomerDetailsRanked
AS
(
    SELECT 
        * 
        ,NoMatch = (
            CASE
                -- Use DENSE_RANK to calculate the number of times the post code repeat for a customers with the same First Name, Last Name and Date of birth. 
                -- If the count is greater than 1 then we know that this is a valid match and we set the NoMatch field value to 0.
                WHEN (DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, post_code ORDER BY id ASC) + DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, post_code ORDER BY id DESC) - 1) > 1 THEN 0
                
                -- Use DENSE_RANK to calculate the number of times mobile number repeat for a customers with the same First Name, Last Name and Date of birth. 
                -- If the count is greater than 1 then we know that this is a valid match and we set the NoMatch field value to 0.
                WHEN (DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, mobile ORDER BY id ASC) + DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, mobile ORDER BY id DESC) - 1) > 1 THEN 0
                
                -- Use DENSE_RANK to calculate the number of times the email repeat for a customers with the same First Name, Last Name and Date of birth. 
                -- If the count is greater than 1 then we know that this is a valid match and we set the NoMatch field value to 0.
                WHEN (DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, Email ORDER BY id ASC) + DENSE_RANK() OVER (PARTITION BY First_Name, Last_Name, DoB, Email ORDER BY id DESC) - 1) > 1 THEN 0
                
                -- Otherwize we set the NoMatch field to 1 since this record does not match another record in the supper set.
                ELSE 1
            END
        )
    FROM 
        @customer_dist

), CustomerDetailsNoMatchCount AS
(
    SELECT
        *
        -- Create a running total of all non-matching records partitioned by First_Name, Last_Name, DoB
        -- This will be used to help determine the final grouping IDs to group user accounts that match together
        ,NoMatchCount = SUM(NoMatch) OVER (PARTITION BY First_Name, Last_Name, DoB ORDER BY id)
    FROM    
        CustomerDetailsRanked
),CustomerDetailsWithGroupingID AS
(
SELECT
    *
    -- Use DENSE_RANK to Rank accounts to is similiar. The rank will also be used to group the account
    -- so that the new ID can be selected for the group.
    ,GroupById = DENSE_RANK() OVER (ORDER BY First_Name, Last_Name, DoB, NoMatchCount)
FROM
    CustomerDetailsNoMatchCount
)
SELECT
    -- USE the MIN OVER clause to find the minimum ID of the accounts that has been grouped together.
    NID = MIN(id) OVER (PARTITION BY GroupById)
    ,*
FROM
    CustomerDetailsWithGroupingID
ORDER BY NID, First_Name, Last_Name, DoB
Mikes answered 23/11, 2020 at 22:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.