MySQL: group by consecutive days and count groups
Asked Answered
B

5

9

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).

So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):

      datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12

The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).

I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)

Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).

The problem is that I have no idea how shall I build this query.

Any help would be highly appreciated!

Boater answered 17/8, 2011 at 13:32 Comment(0)
R
12

You can find the first day of each visit by finding checkins where there was no checkin the day before.

select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
    on start_of_visit.user = previous_day.user
    and start_of_visit.city = previous_day.city
    and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null

There are several important parts to this query.

First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.

Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)

Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:

select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
Recommit answered 17/8, 2011 at 13:56 Comment(3)
Regarding the first aspect...I cannot seem to completely understand your suggestion...Is it possible to give some more details? Thank you! Regarding the second one, my query is right, provided that you don't count the user and the city, as mentioned in my question.Boater
Sorry, I assumed that the result for "how many days a user has been in a city" should look like (user_id, count_of_days).Recommit
Thank you for the details. With several adjustments to fit my actual database table, your query works like a charm. Thank you again!Boater
F
3

Try to apply this code to your task -

CREATE TABLE visits(
  user_id INT(11) NOT NULL,
  dt DATETIME DEFAULT NULL
);

INSERT INTO visits VALUES 
  (1, '2011-06-30 12:11:46'),
  (1, '2011-07-01 13:16:34'),
  (1, '2011-07-01 15:22:45'),
  (1, '2011-07-01 22:35:00'),
  (1, '2011-07-02 13:45:12'),
  (1, '2011-08-01 00:11:45'),
  (1, '2011-08-05 17:14:34'),
  (1, '2011-08-05 18:11:46'),
  (1, '2011-08-06 20:22:12'),
  (2, '2011-08-30 16:13:34'),
  (2, '2011-08-31 16:13:41');


SET @i = 0;
SET @last_dt = NULL;
SET @last_user = NULL;

SELECT v.user_id,
  COUNT(DISTINCT(DATE(dt))) number_of_days,
  MAX(days) number_of_visits
FROM
  (SELECT user_id, dt
        @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS days,
        @last_dt := DATE(dt),
        @last_user := user_id
   FROM
     visits
   ORDER BY
     user_id, dt
  ) v
GROUP BY
  v.user_id;

----------------
Output:

+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
|       1 |              6 |                3 |
|       2 |              2 |                1 |
+---------+----------------+------------------+

Explanation:

To understand how it works let's check the subquery, here it is.

SET @i = 0;
SET @last_dt = NULL;
SET @last_user = NULL;


SELECT user_id, dt,
        @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS 

days,
        @last_dt := DATE(dt) lt,
        @last_user := user_id lu
FROM
  visits
ORDER BY
  user_id, dt;

As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -

+---------+---------------------+------+------------+----+
| user_id | dt                  | days | lt         | lu |
+---------+---------------------+------+------------+----+
|       1 | 2011-06-30 12:11:46 |    1 | 2011-06-30 |  1 |
|       1 | 2011-07-01 13:16:34 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-01 15:22:45 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-01 22:35:00 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-02 13:45:12 |    1 | 2011-07-02 |  1 |
|       1 | 2011-08-01 00:11:45 |    2 | 2011-08-01 |  1 |
|       1 | 2011-08-05 17:14:34 |    3 | 2011-08-05 |  1 |
|       1 | 2011-08-05 18:11:46 |    3 | 2011-08-05 |  1 |
|       1 | 2011-08-06 20:22:12 |    3 | 2011-08-06 |  1 |
|       2 | 2011-08-30 16:13:34 |    1 | 2011-08-30 |  2 |
|       2 | 2011-08-31 16:13:41 |    1 | 2011-08-31 |  2 |
+---------+---------------------+------+------------+----+

Then we group this data set by user and use aggregate functions: 'COUNT(DISTINCT(DATE(dt)))' - counts the number of days 'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.

That is all;)

Fr answered 31/8, 2011 at 13:38 Comment(2)
It seems pretty complicated...can you please give some more details on your code? Would appreciate!Boater
Thank you for the details. It is quite sad that I can't give the bounty to two answers. However, I chose the other answer as the query is a little bit simpler. I am really sorry and I want to thank you again for your answer!Boater
S
1

As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the @LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the @LUser and @LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of

User ID    Distinct Visits   Total Days
1           3                 9
2           1                 2

select PreQuery.User_ID,
       sum( PreQuery.NextVisit ) as DistinctVisits,
       count(*) as TotalDays
   from
      (  select v.user_id,
               if( @LUser <> v.User_ID OR @LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
               @LUser := v.user_id,
               @LDate := date( v.dt )
            from 
               Visits v,
               ( select @LUser := -1, @LDate := date(now()) ) AtVars 
            order by
               v.user_id,
               v.dt  ) PreQuery
    group by 
       PreQuery.User_ID
Solitta answered 2/9, 2011 at 13:46 Comment(2)
Thank you for your answer and for clarifying it!Boater
Glad to help... did it get the exact solution you needed (thus the inclusion of the user ID info too, to help).Solitta
S
0

for a first sub-task:

select count(*) 
from (
select TO_DAYS(p.d)
from p
group by TO_DAYS(p.d)
) t
Straley answered 17/8, 2011 at 13:53 Comment(0)
A
0

I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.

Then you could get you data in one query with something like that: SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city

It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.

But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).

Alf answered 30/8, 2011 at 19:5 Comment(1)
Thank you for your answer, but I would like to stick to my current database structure, at least for now. Also I will need to do some further operations when inserting, as a day may have multiple check-ins, so it isn't so simple with the "check if there is any checkin a day back". This kind of data manipulation can also be made in PHP with the provided database structure, but I was looking for a query to do this job, as it is more clean and convenient.Boater

© 2022 - 2024 — McMap. All rights reserved.