SQL query for run-length, or consecutive identical value encoding
Asked Answered
C

2

10

My goal is to take a set of data that is ordered by id and return a resultset that indicates the number of consecutive rows where the val column is identical. E.g. given this data:

| id | val |
|  1 |  33 |
|  2 |  33 |
|  3 |  44 |
|  4 |  28 |
|  5 |  44 |
|  6 |  44 |

I would like to see this result:

| id | val | run_length |
| 1  | 33  | 2          |
| 3  | 44  | 1          |
| 4  | 28  | 1          |
| 5  | 44  | 2          |

The id column in the resultset is optional. In fact, if it makes it significantly harder, then just leave that column out of the result. I sort of like having it because it "pins" the resultset to a particular location in the table.

I am primarily interested in the result in free database engines. My order of preference for a solution is:

  1. SQLite
  2. Postgres
  3. MySQL
  4. Oracle
  5. SQL Server
  6. Sybase
Canica answered 14/6, 2015 at 13:45 Comment(0)
U
15

I'll choose #2 on your list, because this is incredibly painful to do in SQLite with a single query. The following is standard SQL:

select min(id), val, count(*) as runlength
from (select t.*,
             (row_number() over (order by id) -
              row_number() over (partition by val order by id)
             ) as grp
      from data t
     ) t
group by grp, val;

This uses the difference of two row number calculations to identify the seuqnces of identical values. It should work in the recent versions of databases 2, 4, 5, and 6.

Urus answered 14/6, 2015 at 13:59 Comment(1)
A note from the future - this does work now in SQLite as of 3.25.0 (2018-09-15). cf sqlite.org/windowfunctions.htmlStressful
D
-1

I've been wandering around in RLE space in SQLITE and ran across this post. I believe this code works for #1. The first answer is correct this is a bit painful in SQLite as a single query.

create table example (id integer primary key autoincrement, val integer);

insert into example (val) values (33);
insert into example (val) values (33);
insert into example (val) values (44);
insert into example (val) values (28);
insert into example (val) values (44);
insert into example (val) values (44);


select ren.low_id, e2.val, (ren.high_id - ren.low_id)+1
from example e2
inner join (
select min(hb.low_id) as low_id, hb.high_id as high_id
from 
(
    with nexample(low_id, high_id, val) 
    as 
    (
    select e.id, e.id, e.val from example e
    union all
    select ne.low_id, eu.id, ne.val 
    from nexample ne
    inner join example eu on eu.id = ne.high_id+1 AND eu.val=ne.val
    )
    select ne.low_id, max(ne.high_id) as high_id from nexample ne
    group by ne.low_id
) hb
group by hb.high_id
) ren on ren.low_id = e2.id;

Output:

1|33|2
3|44|1
4|28|1
5|44|2

Note this solution doesn't perform well on very sparse sets... I'm looking for an alternative approach to dealing with sparse sets.

For example on a set of 10000 rows with a val set of [0,1] but all values are 0. This code takes ~2 minutes 30sec to run on my hardware. Not great.

Decuple answered 28/2, 2017 at 0:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.