SQL to select consecutive records with the same value
Asked Answered
I

3

8

I have a table with measurements. Measurement is done every minute. I need to select only rows having the same sample_value more than once consecutively for the same device_id.

Here are initial data:

    sample_date sample_time device_id   sample_value
    20180701    1010        111         11
    20180701    1011        111         12
    20180701    1012        111         13
    20180701    1013        222         11
    20180701    1014        222         11
    20180701    1015        222         12
    20180701    1016        111         12
    20180701    1017        111         11
    20180701    1018        222         13
    20180701    1019        222         12
    20180701    1020        222         13
    20180701    1021        222         12
    20180701    1022        222         12
    20180701    1023        111         12
    20180701    1024        111         13
    20180701    1025        111         13
    20180701    1026        111         12
    20180701    1027        111         13
    20180701    1028        222         14
    20180701    1029        222         13
    20180701    1030        222         14
    20180701    1031        222         14
    20180701    1032        222         14
    20180701    1033        222         14
    20180701    1034        222         14
    20180701    1035        222         14
    20180701    1036        111         13
    20180701    1037        111         13
    20180701    1038        111         14
    20180701    1039        111         13

This is result I'm looking for:

sample_date sample_time device_id   sample_value
20180701    1013        222         11
20180701    1014        222         11
20180701    1021        222         12
20180701    1022        222         12
20180701    1024        111         13
20180701    1025        111         13
20180701    1030        222         14
20180701    1031        222         14
20180701    1032        222         14
20180701    1033        222         14
20180701    1034        222         14
20180701    1035        222         14
20180701    1036        111         13
20180701    1037        111         13

Here are test data:

IF OBJECT_ID('samples', 'U') IS NOT NULL 
DROP TABLE samples; 

create table samples (
sample_date int,
sample_time int,
device_id int,
sample_value int
)

insert samples
values
(20180701, 1010, 111, 11)
,(20180701, 1011, 111, 12)
,(20180701, 1012, 111, 13)
,(20180701, 1013, 222, 11)
,(20180701, 1014, 222, 11)
,(20180701, 1015, 222, 12)
,(20180701, 1016, 111, 12)
,(20180701, 1017, 111, 11)
,(20180701, 1018, 222, 13)
,(20180701, 1019, 222, 12)
,(20180701, 1020, 222, 13)
,(20180701, 1021, 222, 12)
,(20180701, 1022, 222, 12)
,(20180701, 1023, 111, 12)
,(20180701, 1024, 111, 13)
,(20180701, 1025, 111, 13)
,(20180701, 1026, 111, 12)
,(20180701, 1027, 111, 13)
,(20180701, 1028, 222, 14)
,(20180701, 1029, 222, 13)
,(20180701, 1030, 222, 14)
,(20180701, 1031, 222, 14)
,(20180701, 1032, 222, 14)
,(20180701, 1033, 222, 14)
,(20180701, 1034, 222, 14)
,(20180701, 1035, 222, 14)
,(20180701, 1036, 111, 13)
,(20180701, 1037, 111, 13)
,(20180701, 1038, 111, 14)
,(20180701, 1039, 111, 13)

select * from samples

Here is SQL I'm trying to use, but I don't know how to set correct partitioning.

    select *
    from (select    sample_date,
                    sample_time,
                    device_id,
                    sample_value,
                    row_number() over (partition by sample_date,
                                                    device_id,
                                                    sample_value
                                            order by sample_date,
                                                    sample_time,
                                                    device_id) as occurrence
    from samples) t
    where     occurrence > 1

Similar topics:

Select statement to find duplicates on certain fields

How to find consecutive rows based on the value of a column?

Ioneionesco answered 22/8, 2018 at 10:14 Comment(2)
Why there is no row in your desired output for data where device_id = 111 and sample_value = 11? In that case that sample_value also occurs more than once for that same device_id.Cudgel
First record with device_id = 111 and sample_value = 11 is recorded 20180701 at 1010. Second record 20180701 at 1017. They are not consecutive. There is another value in between.Ioneionesco
B
1

If you wanted to do this without using LEAD or LAG then you could do something like this instead:

WITH Ordered AS (
    SELECT
        *,
        ROW_NUMBER() OVER (ORDER BY sample_date, sample_time) AS order_id
    FROM
        samples)
SELECT
    s1.sample_date,
    s1.sample_time,
    s1.device_id,
    s1.sample_value
FROM
    Ordered s1
    INNER JOIN Ordered s2 ON s2.device_id = s1.device_id AND s2.sample_value = s1.sample_value AND s2.order_id = s1.order_id + 1
UNION
SELECT
    s2.sample_date,
    s2.sample_time,
    s2.device_id,
    s2.sample_value
FROM
    Ordered s1
    INNER JOIN Ordered s2 ON s2.device_id = s1.device_id AND s2.sample_value = s1.sample_value AND s2.order_id = s1.order_id + 1
ORDER BY
    1, 2;

Results are:

sample_date sample_time device_id   sample_value
20180701    1013        222         11
20180701    1014        222         11
20180701    1021        222         12
20180701    1022        222         12
20180701    1024        111         13
20180701    1025        111         13
20180701    1030        222         14
20180701    1031        222         14
20180701    1032        222         14
20180701    1033        222         14
20180701    1034        222         14
20180701    1035        222         14
20180701    1036        111         13
20180701    1037        111         13
Billbug answered 22/8, 2018 at 10:47 Comment(1)
Thanks. It works! I will test it on several millions records to see how performance is affected by "union".Ioneionesco
V
0

I think you want to use lag()/lead():

select s.*
from (select s.*,
             lag(device_id) over (order by sample_date, sample_time) as prev_di,
             lead(device_id) over (order by sample_date, sample_time) as next_di,
             lag(sample_value) over (order by sample_date, sample_time) as prev_sv,
             lead(sample_value) over (order by sample_date, sample_time) as next_sv
      from samples s
     ) s
where (prev_sv = sample_value and prev_di = device_id) or
      (next_sv = sample_value and prev_di = device_id);

Here is a SQL Fiddle.

If you specifically want the adjacent row to be the next time unit, you can use exists:

select s.*
from samples s
where exists (select 1
              from samples s2 
              where s2.sample_date = s.sample_date and
                    s2.sample_time in (s.sample_time - 1, s.sample_time + 1
             );
Vaseline answered 22/8, 2018 at 10:32 Comment(2)
Thanks! I tried lag()/lead() version. Almost worked. Record with sample_time = 1027 should be excluded.Ioneionesco
@TaxMax . . . I misunderstood what you meant by the same device_id.Vaseline
K
0

You could try this query:

select date_time,
       device_id,
       sample_value
from ( 
    select date_time,
           device_id,
           sample_value,
           COUNT(*) over (partition by rnDiff) cnt
    from (
        select date_time,
               device_id,
               sample_value,
               ROW_NUMBER() over (order by date_time) -
               ROW_NUMBER() over (partition by device_id, sample_value order by date_time) rnDiff
        from (
            select DATETIMEFROMPARTS(sample_date/10000,(sample_date/100)%100,sample_date%100,sample_time/100,sample_time%100,0,0) date_time,
                   device_id,
                   sample_value
            from samples
        ) a 
    ) a
) a where cnt > 1
order by date_time

In the most inner query I convert your date and time columns to datetime format, so I can order by it easily. Then I use row_number() function to make distinction between group with same sample_value, at last in most outer query I use COUNT(*) over (partition by rnDiff) to count distinct values.

Kinchinjunga answered 22/8, 2018 at 11:5 Comment(1)
Interesting solution! :) But this returns too many rows. E.g. I got row where sample_time = 1015. This row is not needed because sample_value = 12 is not consecutive.Ioneionesco

© 2022 - 2024 — McMap. All rights reserved.