SQL issue - calculate max days sequence
Asked Answered
R

10

6

There is a table with visits data:

uid (INT) | created_at (DATETIME)

I want to find how many days in a row a user has visited our app. So for instance:

SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123

will return:

     d      
------------
 2012-04-28
 2012-04-29
 2012-04-30
 2012-05-03
 2012-05-04

There are 5 records and two intervals - 3 days (28 - 30 Apr) and 2 days (3 - 4 May).

My question is how to find the maximum number of days that a user has visited the app in a row (3 days in the example). Tried to find a suitable function in the SQL docs, but with no success. Am I missing something?


UPD: Thank you guys for your answers! Actually, I'm working with vertica analytics database (http://vertica.com/), however this is a very rare solution and only a few people have experience with it. Although it supports SQL-99 standard.

Well, most of solutions work with slight modifications. Finally I created my own version of query:

-- returns starts of the vitit series 
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d

          s          
---------------------
 2012-04-28 01:00:00
 2012-05-03 01:00:00

-- returns end of the vitit series 
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d

          f          
---------------------
 2012-04-30 01:00:00
 2012-05-04 01:00:00

So now only what we need to do is to join them somehow, for instance by row index.

SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
    SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
    LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
    WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
    SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
    LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
    WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2 

Sample output:

          s          |          f          | seq 
---------------------+---------------------+-----
 2012-04-28 01:00:00 | 2012-04-30 01:00:00 |   3
 2012-05-03 01:00:00 | 2012-05-04 01:00:00 |   2
Retract answered 4/5, 2012 at 11:33 Comment(2)
Tagged with two different sql implentations? Both MySQL and PostGreSQL have different abilities...Kenwood
+1 Good approach. On WHERE t2.d is null GROUP BY t1.d, you can remove the GROUP BY t1.d however :-) Your WHERE t2.d is null is already returning unique t1.d anywayBanausic
B
7

Another approach, the shortest, do a self-join:

with grouped_result as
(
    select 
       sr.d,
       sum((fr.d is null)::int) over(order by sr.d) as group_number
    from tbl sr
    left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)

Output:

          d          | group_number | consecutive_days 
---------------------+--------------+------------------
 2012-04-28 08:00:00 |            1 |                3
 2012-04-29 08:00:00 |            1 |                3
 2012-04-30 08:00:00 |            1 |                3
 2012-05-03 08:00:00 |            2 |                2
 2012-05-04 08:00:00 |            2 |                2
(5 rows)

Live test: http://www.sqlfiddle.com/#!1/93789/1

sr = second row, fr = first row ( or perhaps previous row? ). Basically we are doing a back tracking, it's a simulated lag on database that doesn't support LAG (Postgres supports LAG, but the solution is very long, as windowing doesn't support nested windowing). So in this query, we uses a hybrid approach, simulate LAG via join, then use SUM windowing against it, this produces group number

UPDATE

Forgot to put the final query, the query above illustrate the underpinnings of group numbering, need to morph that into this:

with grouped_result as
(
    select 
       sr.d,
       sum((fr.d is null)::int) over(order by sr.d) as group_number
    from tbl sr
    left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1


STARTING_DATE                END_DATE                     CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700   May, 04 2012 08:00:00-0700   2

UPDATE

I know why my other solution that uses window function became long, it became long on my attempt to illustrate the logic of group numbering and counting over the group. If I'd cut to the chase like in my MySql approach, that windowing function could be shorter. Having said that, here's my old windowing function approach, albeit better now:

with headers as
(
    select 
      d,lag(d) over m is null or d - lag(d) over m  <> interval '1 day' as header
    from tbl
    window m as (order by d)
)      
,sequence_group as
(
    select d, sum(header::int) over (order by d) as group_number
    from headers  
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1

Live test: http://www.sqlfiddle.com/#!1/93789/21

Banausic answered 4/5, 2012 at 13:24 Comment(4)
+1 : Very nice. Not sure I'd have though about a running total type of answer (to generate the group id's). Oh how nice it would be to have a client with such an up to date sql implementation. Would be interested to compare that to a ROW_NUMBER() based grouping mechanism (which avoids the join).Kenwood
I think there is a shorter (and faster) way with window functions and no JOIN than the one you link to. I posted an answer.Roehm
@Dems True, it's nice to have an up to date sql implementation. What's makes it more odd is the opensource/free database trumps the paid onesBanausic
@ErwinBrandstetter I had shorten my query now (and could be faster, the metrics lies in query profiling(checking the execution plan), or testing it in a production database). I just the do shortening of the code in the answer here. See my edit hereBanausic
D
2

In MySQL you could do this:

SET @nextDate = CURRENT_DATE;
SET @RowNum = 1;

SELECT MAX(RowNumber) AS ConecutiveVisits
FROM    (   SELECT  @RowNum := IF(@NextDate = Created_At, @RowNum + 1, 1) AS RowNumber,
                    Created_At,
                    @NextDate := DATE_ADD(Created_At, INTERVAL 1 DAY) AS NextDate
            FROM    Visits
            ORDER BY Created_At
        ) Visits

Example here:

http://sqlfiddle.com/#!2/6e035/8

However I am not 100% certain this is the best way to do it.

In Postgresql:

 ;WITH RECURSIVE VisitsCTE AS
 (  SELECT  Created_At, 1 AS ConsecutiveDays
    FROM    Visits
    UNION ALL
    SELECT  v.Created_At, ConsecutiveDays + 1
    FROM    Visits v
            INNER JOIN VisitsCTE cte
                ON 1 + cte.Created_At = v.Created_At
)
SELECT  MAX(ConsecutiveDays) AS ConsecutiveDays
FROM    VisitsCTE

Example here:

http://sqlfiddle.com/#!1/16c90/9

Datcha answered 4/5, 2012 at 11:58 Comment(1)
Instead of DATE_ADD() you could use just a simple +Bremsstrahlung
S
2

I know Postgresql has something similar to common table expressions as available in MSSQL. I'm not that familiar with Postgresql, but the code below works for MSSQL and does what you want.

create table #tempdates (
    mydate date
)

insert into #tempdates(mydate) values('2012-04-28')
insert into #tempdates(mydate) values('2012-04-29')
insert into #tempdates(mydate) values('2012-04-30')
insert into #tempdates(mydate) values('2012-05-03')
insert into #tempdates(mydate) values('2012-05-04');

with maxdays (s, e, c)
as
(
    select mydate, mydate, 1
    from #tempdates
    union all
    select m.s, mydate, m.c + 1
    from #tempdates t
    inner join maxdays m on DATEADD(day, -1, t.mydate)=m.e
)
select MIN(o.s),o.e,max(o.c)
from (
  select m1.s,max(m1.e) e,max(m1.c) c
  from maxdays m1
  group by m1.s
) o
group by o.e

drop table #tempdates

And here's the SQL fiddle: http://sqlfiddle.com/#!3/42b38/2

Songwriter answered 4/5, 2012 at 11:59 Comment(0)
A
2

All are very good answers, but I think I should contribute by showing another approach utilizing an analytical capability specific to Vertica (after all it is part of what you paid for). And I promise the final query is short.

First, query using conditional_true_event(). From Vertica's documentation:

Assigns an event window number to each row, starting from 0, and increments the number by 1 when the result of the boolean argument expression evaluates true.

The example query looks like this:

select uid, created_at, 
       conditional_true_event( created_at - lag(created_at) > '1 day' ) 
       over (partition by uid order by created_at) as seq_id
from visits;

And output:

uid  created_at           seq_id  
---  -------------------  ------  
123  2012-04-28 00:00:00  0       
123  2012-04-29 00:00:00  0       
123  2012-04-30 00:00:00  0       
123  2012-05-03 00:00:00  1       
123  2012-05-04 00:00:00  1       
123  2012-06-04 00:00:00  2       
123  2012-06-04 00:00:00  2     

Now the final query becomes easy:

select uid, seq_id, count(1) num_days, min(created_at) s, max(created_at) f
from
(
    select uid, created_at, 
       conditional_true_event( created_at - lag(created_at) > '1 day' ) 
       over (partition by uid order by created_at) as seq_id
    from visits
) as seq
group by uid, seq_id;

Final Output:

uid  seq_id  num_days  s                    f                    
---  ------  --------  -------------------  -------------------  
123  0       3         2012-04-28 00:00:00  2012-04-30 00:00:00  
123  1       2         2012-05-03 00:00:00  2012-05-04 00:00:00  
123  2       2         2012-06-04 00:00:00  2012-06-04 00:00:00  

One final note: num_days is actually number of rows of the inner query. If there are two '2012-04-28' visits in the original table (i.e. duplicates), you might want to work around that.

Ariellearies answered 20/12, 2012 at 20:48 Comment(0)
K
1

The following should be Oracle friendly, and not require recursive logic.

;WITH
  visit_dates (
    visit_id,
    date_id,
    group_id
  )
AS
(
  SELECT
    ROW_NUMBER() OVER (ORDER BY TRUNC(created_at)),
    TRUNC(SYSDATE) - TRUNC(created_at),
    TRUNC(SYSDATE) - TRUNC(created_at) - ROW_NUMBER() OVER (ORDER BY TRUNC(created_at))
  FROM
    visits
  GROUP BY
    TRUNC(created_at)
)
,
  group_duration (
    group_id,
    duration
  )
AS
(
  SELECT
    group_id,
    MAX(date_id) - MIN(date_id) + 1  AS duration
  FROM
    visit_dates
  GROUP BY
    group_id
)
SELECT
  MAX(duration)  AS max_duration
FROM
  group_duration
Kenwood answered 4/5, 2012 at 13:1 Comment(0)
B
1

Postgresql:

with headers as
(
    select 
        d,
        lag(d) over m is null or d - lag(d) over m  <> interval '1 day' as header

    from tbl
    window m as (order by d)
)      
,sequence_group as
(
    select d, sum(header::int) over m as group_number 
    from headers
    window m as (order by d)
)
,consecutive_list as
(
    select d, group_number, count(d) over m as consecutive_count
    from sequence_group 
    window m as (partition by group_number)
)
select * from consecutive_list

Divide-and-conquer approach: 3 steps

1st step, find headers:

with headers as
(
    select 
        d,
        lag(d) over m is null or d - lag(d) over m  <> interval '1 day' as header

    from tbl
    window m as (order by d)
)
select * from headers

Output:

          d          | header 
---------------------+--------
 2012-04-28 08:00:00 | t
 2012-04-29 08:00:00 | f
 2012-04-30 08:00:00 | f
 2012-05-03 08:00:00 | t
 2012-05-04 08:00:00 | f
(5 rows)

2nd step, designate grouping:

with headers as
(
    select 
        d,
        lag(d) over m is null or d - lag(d) over m  <> interval '1 day' as header

    from tbl
    window m as (order by d)
)      
,sequence_group as
(
    select d, sum(header::int) over m as group_number 
    from headers
    window m as (order by d)
)
select * from sequence_group

Output:

          d          | group_number 
---------------------+--------------
 2012-04-28 08:00:00 |            1
 2012-04-29 08:00:00 |            1
 2012-04-30 08:00:00 |            1
 2012-05-03 08:00:00 |            2
 2012-05-04 08:00:00 |            2
(5 rows)

3rd step, count max days:

with headers as
(
    select 
        d,
        lag(d) over m is null or d - lag(d) over m  <> interval '1 day' as header

    from tbl
    window m as (order by d)
)      
,sequence_group as
(
    select d, sum(header::int) over m as group_number 
    from headers
    window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group 
window m as (partition by group_number)
)
select * from consecutive_list

Output:

          d          | group_number | consecutive_count 
---------------------+--------------+-----------------
 2012-04-28 08:00:00 |            1 |               3
 2012-04-29 08:00:00 |            1 |               3
 2012-04-30 08:00:00 |            1 |               3
 2012-05-03 08:00:00 |            2 |               2
 2012-05-04 08:00:00 |            2 |               2
(5 rows)
Banausic answered 4/5, 2012 at 13:7 Comment(0)
B
1

This is for MySQL, the shortest, and uses minimal variable (one variable only):

select 
   min(d) as starting_date, max(d) as ending_date, 
   count(d) as consecutive_days
from
(
  select 
     sr.d,
     IF(fr.d is null,@group_number := @group_number + 1,@group_number) 
        as group_number
  from tbl sr
  left join tbl fr on sr.d = adddate(fr.d,interval 1 day)
  cross join (select @group_number := 0) as grp
) as x
group by group_number

Output:

STARTING_DATE                  ENDING_DATE                  CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700   April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700     May, 04 2012 08:00:00-0700   2

Live test: http://www.sqlfiddle.com/#!2/65169/1

Banausic answered 4/5, 2012 at 14:30 Comment(0)
R
1

For PostgreSQL 8.4 or later, there is a short and clean way with window functions and no JOIN.
I'd expect this to be the fastest solution posted so far:

WITH x AS (
    SELECT created_at AS d
         , lag(created_at) OVER (ORDER BY created_at) = (created_at - 1) AS nu
    FROM   visits
    WHERE  uid = 1
    )
   , y AS (
    SELECT d, count(NULLIF(nu, TRUE)) OVER (ORDER BY d) AS seq
    FROM   x
    )
SELECT count(*) AS max_days, min(d) AS seq_from,  max(d) AS seq_to
FROM   y
GROUP  BY seq
ORDER  BY 1 DESC
LIMIT  1;

Returns:

max_days | seq_from   | seq_to
---------+------------+-----------
3        | 2012-04-28 | 2012-04-30

Assuming that created_at is a date and unique.

  1. In CTE x: for every day our user visits, check if he was here yesterday, too. To calculate "yesterday" just use created_at - 1 The first row is a special case and will produce NULL here.

  2. In CTE y: calculate a running count of "days without yesterday so far" (seq) for every day. NULL values don't count, so count(NULLIF(nu, TRUE)) is the fastes and shortest way, also covering the special case.

  3. Finally, group days per seq and count the days. While being at it I added first and last day of the sequence. ORDER BY length of the sequence, and pick the longest one.

Roehm answered 4/5, 2012 at 16:8 Comment(0)
K
1

There have already been several answers to this question. However the SQL statements all seem too complex. This can be accomplished with basic SQL, a way to enumerate rows, and some date arithmetic.

The key observation is that if you have a bunch of days and have a parallel sequence of integers, then the difference is a constant date when the days are in a sequence.

The following query uses this observation to answer the original question:

select uid, min(d) as startdate, count(*) as numdaysinseq
from 
(
   select uid, d, adddate(d, interval -offset day) as groupstart
   from 
   (
     select uid, d, row_number() over (partition by uid order by date) as offset
     from 
     (
       SELECT DISTINCT uid, DATE(created_at) AS d
       FROM visits
     ) t
   ) t
) t

Alas, mysql does not have the row_number() function. However, there is a work-around with variables (and most other databases do have this function).

Kudu answered 5/5, 2012 at 3:44 Comment(0)
B
1

Upon seeing OP's query approach for their Vertica database, I tried making the two joins run at the same time:

These Postgresql and Sql Server query versions shall both work in Vertica

Postgresql version:

select 
  min(gr.d) as start_date,
  max(gr.d) as end_date,
  date_part('day', max(gr.d) - min(gr.d))+1 as consecutive_days
from 
(
  select 
  cr.d, (row_number() over() - 1) / 2 as pair_number
  from tbl cr   
  left join tbl pr on pr.d = cr.d - interval '1 day'
  left join tbl nr on nr.d = cr.d + interval '1 day'
  where pr.d is null <> nr.d is null
) as gr
group by pair_number
order by start_date

Regarding pr.d is null <> nr.d is null. It means, it's either the previous row is null or next row is null, but they can never both be null, so this basically removes the non-consecutive dates, as non-consecutive dates' previous & next row are nulls (and this basically gives us all dates that are just headers and footers only). This is also called an XOR operation

If we are left with consecutive dates only, we can now pair them via row_number:

(row_number() over() - 1) / 2 as pair_number

row_number() starts with 1, we need to subtract it with 1 (we can also add with 1 instead), then we divide it by two; this makes the paired date adjacent to each other

Live test: http://www.sqlfiddle.com/#!1/fc440/7


This is the Sql Server version:

select 
  min(gr.d) as start_date,
  max(gr.d) as end_date,
  datediff(day, min(gr.d),max(gr.d)) +1 as consecutive_days
from 
(
  select 
     cr.d, (row_number() over(order by cr.d) - 1) / 2 as pair_number
  from tbl cr   
  left join tbl pr on pr.d = dateadd(day,-1,cr.d)
  left join tbl nr on nr.d = dateadd(day,+1,cr.d)
  where         
       case when pr.d is null then 1 else 0 end
    <> case when nr.d is null then 1 else 0 end
) as gr
group by pair_number
order by start_date

Same logic as above, except for artificial differences on date functions. And sql Server requires an ORDER BY clause on its OVER, while Postgresql's OVER can be left empty.

Sql Server has no first class boolean, that's why we cannot compare booleans directly:

pr.d is null <> nr.d is null

We must do this in Sql Server:

   case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end

Live test: http://www.sqlfiddle.com/#!3/65df2/17

Banausic answered 5/5, 2012 at 14:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.