Delete "duplicate" rows in SQL Server 2010
Asked Answered
S

7

5

I made a mistake in a bulk insert script, so now i have "duplicate" rows with different colX. I need to delete this duplicate rows, but I cant figure out how. To be more precise, I have this:

 col1 | col2 | col3 | colX      
----+----------------------
  0   |  1   |  2   |  a
  0   |  1   |  2   |  b
  0   |  1   |  2   |  c
  0   |  1   |  2   |  a
  3   |  4   |  5   |  x
  3   |  4   |  5   |  y
  3   |  4   |  5   |  x
  3   |  4   |  5   |  z

and I want to keep the first occurrence of each (row, colX):

 col1 | col2 | col3 | colX      
----+----------------------
  0   |  1   |  2   |  a
  3   |  4   |  5   |  x

Thank you for your replies :)

Sinistrorse answered 21/6, 2012 at 2:59 Comment(5)
Database tables have no concept of row order. Do you want to sort by min(colX) and keep those rows? Is there a timestamp column on the row?Workbook
What version of SQL Server are you using? As far as I know, there is no SQL Server 2010.Step
If you have ` 0 | 1 | 3 | a` on your data, should that be kept? or should it be removed?Catty
SqlServer CTE would be good to go with! @Elik, you were right to mention it.Rayshell
I'm always use a primary unique key for all the tables. This is good practice for issues like this.Vane
C
10

Try the simplest approach with Sql Server's CTE: http://www.sqlfiddle.com/#!3/2d386/2

Data:

CREATE TABLE tbl
    ([col1] int, [col2] int, [col3] int, [colX] varchar(1));

INSERT INTO tbl
    ([col1], [col2], [col3], [colX])
VALUES
    (0, 1, 2, 'a'),
    (0, 1, 2, 'b'),
    (0, 1, 2, 'c'),
    (0, 1, 2, 'a'),
    (3, 4, 5, 'x'),
    (3, 4, 5, 'y'),
    (3, 4, 5, 'x'),
    (3, 4, 5, 'z');

Solution:

select * from tbl;

with a as
(
  select row_number() over(partition by col1 order by col2, col3, colX) as rn 
  from tbl   
)
delete from a where rn > 1;

select * from tbl;

Output:

| COL1 | COL2 | COL3 | COLX |
-----------------------------
|    0 |    1 |    2 |    a |
|    0 |    1 |    2 |    b |
|    0 |    1 |    2 |    c |
|    0 |    1 |    2 |    a |
|    3 |    4 |    5 |    x |
|    3 |    4 |    5 |    y |
|    3 |    4 |    5 |    x |
|    3 |    4 |    5 |    z |


| COL1 | COL2 | COL3 | COLX |
-----------------------------
|    0 |    1 |    2 |    a |
|    3 |    4 |    5 |    x |

Or perhaps this: http://www.sqlfiddle.com/#!3/af826/1

Data:

CREATE TABLE tbl
    ([col1] int, [col2] int, [col3] int, [colX] varchar(1));

INSERT INTO tbl
    ([col1], [col2], [col3], [colX])
VALUES
    (0, 1, 2, 'a'),
    (0, 1, 2, 'b'),
    (0, 1, 2, 'c'),
    (0, 1, 2, 'a'),
    (0, 1, 3, 'a'),
    (3, 4, 5, 'x'),
    (3, 4, 5, 'y'),
    (3, 4, 5, 'x'),
    (3, 4, 5, 'z');

Solution:

select * from tbl;


with a as
(
    select row_number() over(partition by col1, col2, col3 order by colX) as rn 
    from tbl   
)
delete from a where rn > 1;

select * from tbl;

Output:

| COL1 | COL2 | COL3 | COLX |
-----------------------------
|    0 |    1 |    2 |    a |
|    0 |    1 |    2 |    b |
|    0 |    1 |    2 |    c |
|    0 |    1 |    2 |    a |
|    0 |    1 |    3 |    a |
|    3 |    4 |    5 |    x |
|    3 |    4 |    5 |    y |
|    3 |    4 |    5 |    x |
|    3 |    4 |    5 |    z |

| COL1 | COL2 | COL3 | COLX |
-----------------------------
|    0 |    1 |    2 |    a |
|    0 |    1 |    3 |    a |
|    3 |    4 |    5 |    x |
Catty answered 21/6, 2012 at 4:8 Comment(0)
M
2

I would suggest to use CTE and read all non-dup records in a separate table if you have many duplicates. However, there is a recommended post to follow: MSDN

Majorette answered 21/6, 2012 at 3:10 Comment(1)
It looks like you were first to mention sql "CTE" approach, which is the simplest and in most scenarios it works.Rayshell
W
2

If you are OK with just keeping the minimum value of colX, you can do this:

delete t from t inner join 
    (select  min(colx) mincolx, col1, col2, col3
     from t
     group by col1, col2, col3
     having count(1) > 1) as duplicates
   on (duplicates.col1 = t.col1
   and duplicates.col2 = t.col2
   and duplicates.col3 = t.col3
   and duplicates.mincolx <> t.colx)

The problem is that you still have rows where all four columns are the same. To get rid of these, after running the first query, you then have to use a temp table.

SELECT distinct col1, col2, col3, colx 
INTO temp
  FROM (SELECT col1, col2, col3
         from t 
         group by col1, col2, col3
         having count(1) > 1) subq;

DELETE from t where exists 
   (select 1 from temp 
     where temp.col1 = t.col1 
       and temp.col2 = t.col2 
       and temp.col3 = t.col3);

Here's an example SQLFiddle.

Workbook answered 21/6, 2012 at 3:25 Comment(0)
P
1

Assuming colX is unique (which is not the case in your example, even though you said "different colX") you could use the following to delete the duplicates:

;with cteDuplicates as
(
    select 
        *,
        row_number() over (partition by col1, col2, col3 order by colX) as ID
    from Duplicates
)
delete D from Duplicates D
    inner join cteDuplicates C on C.colX = D.Colx
where ID > 1

(Let's say your table is named "Duplicates")

If colX is not unique, add a new uniqueidentifier column, insert distinct values into it and then use the code above by joining on that column instead of colX.

Paragrapher answered 21/6, 2012 at 3:16 Comment(0)
V
0

I assume you're using SQL Server 2005/2008.

SELECT col1,
       col2,
       col3,
       colx
FROM
  (SELECT *,
          row_number() OVER (PARTITION BY col1,col2,col3
                             ORDER BY colx) AS r
   FROM table_name) a
WHERE r = 1;
Varicella answered 21/6, 2012 at 4:29 Comment(0)
T
0

Simplest solution could be as follows suppose we have table emp_dept(empid, deptid) which has duplicate rows, On Oracle database

  delete from emp_dept where exists ( select * from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.rowid < emp_dept.rowid ) 

On sql server or anydatabase which does not support row id kinda feature , we need to add identity column just to identify each row. say we have added nid as identity to the table

alter table emp_dept add nid int identity(1,1) -- to add identity column

now query to delete duplicate could be written as

  delete from emp_dept where exists ( select * from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.nid< emp_dept.nid ) 

Here the concept is delete all rows for which there exists other rows which have similar core values but smaller rowid or identity. Hence if there exists duplicate rows then one which has higher row id or identity will get deleted. and for row there isn't duplicate it fail in finding lower row id hence will not get deleted.

Towne answered 11/6, 2013 at 6:19 Comment(0)
Y
0

Try this code bt on your own risk

Delete from Table_name
WHERE Table_name.%%physloc%%
      NOT IN (SELECT MAX(b.%%physloc%%)
              FROM   Table_name b
              group by Col_1,Col_2)

Second method using row_number() this is safe method

WITH CTE_Dup AS
(

 SELECT * ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo)
 AS ROW_NO
 from dbo.SalesOrderDetails
)
Delete FROM CTE_Dup;
Yoshi answered 8/12, 2014 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.