Longest Consecutive Days Count for BigQuery
Asked Answered
K

2

5

Right now I just have an aggregate of how many days a user has worked. I'm trying to change this query to most continuous days worked.

Where u12345 would be 4 and u1 would be 2.

Is this possible to do with a BigQuery statement?

EDIT I am Kind of close with the following query but my u1 is getting 3 instead of 2.

SELECT MIN(e.timestamp) as date_created, e.uid, COUNT(e.uid) + 1 AS streak
FROM OnSite e
LEFT JOIN OnSite ee 
  ON e.uid = ee.uid 
AND DATE(e.timestamp) = DATE(DATE_ADD(ee.timestamp, INTERVAL -1 DAY))
WHERE ee.uid IS NOT NULL
GROUP BY e.uid;

Schema (MySQL v5.7)

CREATE TABLE OnSite
    (`uid` varchar(55), `worksite_id`  varchar(55), `timestamp` datetime)
;

INSERT INTO OnSite
    (`uid`, `worksite_id`, `timestamp`)
VALUES
  ("u12345", "worksite_1", '2019-01-01'),
  ("u12345", "worksite_1", '2019-01-02'),
  ("u12345", "worksite_1", '2019-01-03'),
  ("u12345", "worksite_1", '2019-01-04'),
  ("u12345", "worksite_1", '2019-01-06'),
  ("u1", "worksite_1", '2019-01-01'),
  ("u1", "worksite_1", '2019-01-02'),
  ("u1", "worksite_1", '2019-01-05'),
  ("u1", "worksite_1", '2019-01-06')

;

Query #1

SELECT    uid, COUNT(DISTINCT timestamp) Total
FROM      OnSite
GROUP BY  uid;

| uid    | Total |
| ------ | ----- |
| u1     | 4     |
| u12345 | 5     |

View on DB Fiddle

Kwang answered 16/1, 2019 at 21:44 Comment(11)
Maybe you need to process data after the query. I don't know query to do that.Heavensent
@Cochu I added an edit that I'm getting close but I'm one number offKwang
u12345 would be 3, not 4 because "2019-01-04" doesn't have next dayHeavensent
The thought is that 2019-01-01 to 2019-01-04 is 4 consecutive days. So it would be 4 @cochuKwang
I think you will either need a stored procedure or some application code as to do this requires looping through all the data. The reason COUNT doesn't work is because your u1 has more than one streak, and days from both streaks are being counted in your result.Myramyrah
@Myramyrah hmm, well my current result is just counting total days all together. But I see what you mean. Would it be possible to just list both streaks? It would suck to get 5 years worth of data for every single user and code a consecutive day streakKwang
please check my answer. It gives you expected result with no use of stored proceduresBallast
It does seem @Simonare has proved me wrong as long as you are using MySQL 8Myramyrah
It appears so, unfortunately this was for BigQuery and it doesn't like it. But I never mentioned it @MyramyrahKwang
@Kwang If the answer doesn't actually solve your problem you should leave the question open so you can possibly get a solution which does work for you (you can just upvote the answer instead). I've added the bigquery tag to your question so that answers should be limited to code that works there.Myramyrah
@Myramyrah I appreciate that nick, I've left the question openKwang
D
9

Below is for BigQuery Standard SQL

In case if you are interested in max consecutive days of the users on the same worksite:

#standardSQL
SELECT uid, MAX(consecuitive_days) max_consecuitive_days
FROM (
  SELECT uid, grp, COUNT(1) consecuitive_days
  FROM (
    SELECT uid, 
      COUNTIF(step > 1) OVER(PARTITION BY uid, worksite_id ORDER BY ts) grp
    FROM (
      SELECT uid, worksite_id, ts, 
        DATE_DIFF(ts, LAG(ts) OVER(PARTITION BY uid, worksite_id ORDER BY ts), DAY) step 
      FROM `project.dataset.table`
    )
  ) GROUP BY uid, grp
) GROUP BY uid  

In case if worksite does not matter and you are looking just for max consecutive days:

#standardSQL
SELECT uid, MAX(consecuitive_days) max_consecuitive_days
FROM (
  SELECT uid, grp, COUNT(1) consecuitive_days
  FROM (
    SELECT uid, 
      COUNTIF(step > 1) OVER(PARTITION BY uid ORDER BY ts) grp
    FROM (
      SELECT uid, ts, 
        DATE_DIFF(ts, LAG(ts) OVER(PARTITION BY uid ORDER BY ts), DAY) step 
      FROM `project.dataset.table`
    )
  ) GROUP BY uid, grp
) GROUP BY uid  

You can test, play any of above with he sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'u12345' uid, 'worksite_1' worksite_id, DATE '2019-01-01' ts UNION ALL
  SELECT 'u12345', 'worksite_1', '2019-01-02' UNION ALL
  SELECT 'u12345', 'worksite_1', '2019-01-03' UNION ALL
  SELECT 'u12345', 'worksite_1', '2019-01-04' UNION ALL
  SELECT 'u12345', 'worksite_1', '2019-01-06' UNION ALL
  SELECT 'u1', 'worksite_1', '2019-01-01' UNION ALL
  SELECT 'u1', 'worksite_1', '2019-01-02' UNION ALL
  SELECT 'u1', 'worksite_1', '2019-01-05' UNION ALL
  SELECT 'u1', 'worksite_1', '2019-01-06' 
)
SELECT uid, MAX(consecuitive_days) max_consecuitive_days
FROM (
  SELECT uid, grp, COUNT(1) consecuitive_days
  FROM (
    SELECT uid, 
      COUNTIF(step > 1) OVER(PARTITION BY uid ORDER BY ts) grp
    FROM (
      SELECT uid, ts, 
        DATE_DIFF(ts, LAG(ts) OVER(PARTITION BY uid ORDER BY ts), DAY) step 
      FROM `project.dataset.table`
    )
  ) GROUP BY uid, grp
) GROUP BY uid   

with result:

Row uid     max_consecuitive_days    
1   u12345  4    
2   u1      2    
Drape answered 17/1, 2019 at 5:11 Comment(3)
Very elegant solution. This improved my answer as well.Ballast
Thank you Mikhail! This works exactly as expected. I did have a question about what you meant by for the users on the same worksite bit. I'm having trouble understanding what is going on here. It works a bit differently if I were to add FROM project.dataset.table WHERE worksite_id = 'x' (which might be the totally wrong way of focusing on one worksite)Kwang
what I meant is - you can look for Longest Consecutive Days Count for each user no matter which worksite that user was working on. But another option would be to look for Longest Consecutive Days Count for each user with extra condition of being on the same worksite during those days. I do not know nature of your data and meaning of worksite - so provided you with two versions of query for respective options. hope this answers your follow up question :o)Drape
L
4

does this fit for you?

set @gr=1;
select uid, max(cnt) max_cnt from (
  select uid, grp, count(*) cnt from (
    select uid, 
      case when ifnull(DATE_ADD(oldDate, INTERVAL 1 DAY), timestamp)= timestamp then 
        @gr 
      else  
        @gr := @gr +1 
      end grp
    from
    (
        SELECT    
          uid, 
          timestamp, 
          lag(timestamp) over (partition by uid order by timestamp asc) as oldDate
        FROM      OnSite
    ) t
  )t2
  group by uid, grp
  )t3
group by uid

Result

| uid    | max_cnt |
| ------ | ------- |
| u1     | 2       |
| u12345 | 4       |

DB Fiddle

Liverish answered 16/1, 2019 at 22:4 Comment(9)
I think there is some errors in the syntax, it's not running for me in the DB FiddleKwang
I added my Db FiddleBallast
Can you check the link? I'm just seeing my first query thereKwang
Yeah, your DB Fiddle only make a group byHeavensent
I am sorry, I forgot to fork :) I updated the link nowBallast
I dont think this is giving me the right data back. u12345 would be 4 and u1 would be 2.Kwang
Please check my answer. It now gives you expected resultBallast
Awesome work Simonare, it works as expected! Thanks for looking into this. Unfortunately doesn't work for my BigQuery purposes but I never said anything about that to you so I've marked it as correctKwang
Still is a good answer even if it turns out not to be of use to OP.Myramyrah

© 2022 - 2024 — McMap. All rights reserved.