Delete rows following a duplicate

Asked 25/5, 2011 at 11:14 Answered 25/5, 2011 at 12:2

Solved sql sql-server sql-server-2005 sql-delete gaps-and-islands

I have a list of user login and logout stamps. Unfortunately a LOGIN entry might not always be followed by a LOGOUT entry.
I wish to delete any row which has the same [event] and [user_id] as previous row when ordered by [event_date] Any suggestions on how to do this?

Example table

CREATE TABLE #LOG (
    [id] int IDENTITY(1,1),
    [user_id] int,
    [event] varchar(50),
    [event_date] datetime
);
INSERT INTO #LOG ([user_id], [event], [event_date])
SELECT 1,'LOGIN',{ts '2010-12-15 15:31:59'}
UNION ALL SELECT 1,'LOGOUT',{ts '2010-12-15 15:32:55'}
UNION ALL SELECT 1,'LOGIN',{ts '2010-12-15 15:38:04'}
UNION ALL SELECT 1,'LOGOUT',{ts '2010-12-15 15:38:17'}
UNION ALL SELECT 1,'LOGOUT',{ts '2010-12-15 15:38:45'} -- Delete
UNION ALL SELECT 2,'LOGIN',{ts '2010-12-15 16:59:39'}
UNION ALL SELECT 2,'LOGOUT',{ts '2010-12-15 17:00:08'}
UNION ALL SELECT 2,'LOGOUT',{ts '2010-12-15 17:00:39'} -- Delete
UNION ALL SELECT 2,'LOGOUT',{ts '2010-12-15 17:01:16'} -- Delete
UNION ALL SELECT 2,'LOGIN',{ts '2010-12-15 17:01:38'}
UNION ALL SELECT 2,'LOGIN',{ts '2010-12-15 17:02:26'} -- Delete
UNION ALL SELECT 2,'LOGOUT',{ts '2010-12-15 17:02:39'}

Linseed answered 25/5, 2011 at 11:14 Comment(7)

+1 for teaching me some new syntax. I have never seen the {ts 'xxx'}before. – Intercalate 25/5, 2011 at 11:18

What is this {ts 'xxx'}? – Scratchboard 25/5, 2011 at 11:26

By the looks of it, {ts 'xxx'} seems to relate to a timestamp, very interesting! – Hoebart 25/5, 2011 at 11:38

It allows the parser to determine the datatype, so as not to confuse it as a string, which may then need implicit conversion to a date. – Fernandez 25/5, 2011 at 11:41

Thanks @Dems, where can I find details about this? I would like to read more about it. Any link? – Scratchboard 25/5, 2011 at 11:44

Start with this? snippets.dzone.com/posts/show/754 – Fernandez 25/5, 2011 at 11:59

Aah, the {ts 'xxx'} is ODBC Datetime Format . You can find it here: msdn.microsoft.com/en-us/library/ms190234%28v=sql.90%29.aspx – Linseed 25/5, 2011 at 12:7

;WITH T1 AS
(
SELECT * , 
        ROW_NUMBER() OVER (ORDER BY event_date)-
        ROW_NUMBER() OVER (PARTITION BY [user_id], [event] 
                               ORDER BY event_date) AS Grp
FROM #LOG
),T2 AS
(
SELECT 
   ROW_NUMBER() OVER (PARTITION BY [user_id], [event], Grp 
                          ORDER BY event_date) RN
FROM T1
)
DELETE FROM T2 
WHERE RN > 1

Dupondius answered 25/5, 2011 at 11:43 Comment(8)

+1 : Over a small dataset, without an index, this does appear more effecient than other options. Which impresses me, I'd expect it wasn't, need to play with how SQL Server optimises ROW_NUMBER() a bit more :) With a covering index on user_id, event_date, however, the version without ROW_NUMBER() seems more efficient. – Fernandez 25/5, 2011 at 11:57

I know a little about CTE. I have also used this in my recursive query. But how deleting from T2 effecting #LOG? What am I missing? – Scratchboard 25/5, 2011 at 12:1

@Dems - Thanks, I haven't tested performance I'll leave that to the OP against their data. I saw on twitter that SQL Server 2011 will support lag and increase support for windowed functions generally which I think should make this type of query easier. @Muhammad deleting from a CTE acts the same as deleting from a View. It deletes the corresponding row in the base table. – Dupondius 25/5, 2011 at 12:5

@Muhammad : Imagine the CTE as a View with Schema Binding. By deleteing from the view, you are deleting from the underlying tables. – Fernandez 25/5, 2011 at 12:11

@Martin : Expanding the windowed function support will be good. I can start eating into my colleagues claims that SAS can do statistical work that SQL Sever sucks at :) – Fernandez 25/5, 2011 at 12:12

@Dems - Adam Machanic tweeted this slide from an Itzik presentation on what (apparently) is to be expected in the next Denali CTP – Dupondius 25/5, 2011 at 12:15

@Martin: Most all solutions worked perfectly. This solution did however also teach me to use multiple CTE's. Thanks for the help all! – Linseed 25/5, 2011 at 12:17

@Martin - Itzik has been holding a torch for those features for years and he is right, they are needed. – Cid 25/5, 2011 at 16:3

My understanding is that you want to delete entries such that the pattern is always In,Out,In,Out,etc.

This means a record is deleted if the preceding record (when order by user_id, then event_date), is of the same event.

There are two options I'd use to go about this...

DELETE
  #log
WHERE
  event = (
           SELECT
             TOP 1
             event
           FROM
             #log AS [preceding]
           WHERE
             [preceding].user_id = #log.user_id
             AND [preceding].event_date < #log.event_date
           ORDER BY
             [preceding].event_date DESC
          )

Or...

WITH ordered_log AS (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date) AS user_event_id,
    *
  FROM
    #log
)

DELETE
  ordered_log
FROM
  ordered_log
INNER JOIN
  ordered_log   AS [preceding]
    ON  [preceding].login_id      = [ordered_log].login_id
    AND [preceding].user_event_id = [ordered_log].user_event_id - 1
WHERE
  [preceding].event = [ordered_log].event

Either way, I highly recommend an Index covering user_id then event_date.

Note: The first version does not cope with the possibility of two events having the same timestamp. The latter, however, does.

Fernandez answered 25/5, 2011 at 11:19 Comment(0)

Using the ROW_NUMBER functionality of SQL Server would be an option

SQL Statement

;WITH q AS (
    SELECT  Rownumber = ROW_NUMBER() OVER (ORDER BY user_id, event_date)
    , user_id
    , event
    , event_date
    FROM    #LOG
)
DELETE FROM #LOG
FROM    #LOG l
        INNER JOIN (
            SELECT  q2.*
            FROM    q q1
                    INNER JOIN q q2 ON  q2.Rownumber = q1.Rownumber + 1
                                        AND q2.user_id = q1.user_id
                                        AND q2.event = q1.event
        ) q ON  q.user_id = l.user_id
                AND q.event_date = l.event_date

SELECT  *
FROM    #LOG

Intercalate answered 25/5, 2011 at 11:23 Comment(0)

If you have to delete duplicate row. Then no need to set the order by clause.

Try below

Delete l from #LOG l
Inner Join 
(
    Select id from #LOG l
    Inner Join(
        Select user_id, event from #LOG 
        group by user_id, event
        having COUNT(user_id) > 1 and COUNT(event) > 1
    )T
    on (l.user_id = t.user_id) and (l.event = t.event)
)T
on T.id = l.id

Bazaar answered 25/5, 2011 at 12:2 Comment(0)

Recommended topics

Hot tags