PostgreSQL - GROUP BY 10 minutes based in each row
Asked Answered
S

3

6

I have a problem difficult to solve, I think you can help. I have a table with millions of records in which precise group every 10 minutes, based on the registry value, for example:

Record "01 | 2011/01/03 19:18:00.300" the time it needs to count the records is 19:18:00.300 to 19:28:00.299. With this process it will group records 01,02,03.

Record "04 | 2011/01/03 19:29:54.289" the time it needs to count the records is 19:29:54.289 to 19:39:54.288. With this process it will group records only the record 04.

Record "05 | 2011/01/04 14:43:43.067", the time he needs to count the records is 14:43:43.067 to 14:43:53.066. With this process it will group records 05,06,07.

Record "08 | 2011/01/04 14:57:55.608;" the time it needs to count the records is 14:57:55.608 to 15:07:55.607. With this process it will group records 08,09,10,11,12,13,14,15.

Input data:

ID   TS
01   2011/01/03 19:18:00.300
02   2011/01/03 19:18:00.503
03   2011/01/03 19:20:26.335
04   2011/01/03 19:29:54.289
05   2011/01/04 14:43:43.067
06   2011/01/04 14:50:10.727
07   2011/01/04 14:52:26.827
08   2011/01/04 14:57:55.608
09   2011/01/04 14:57:55.718
10   2011/01/04 14:59:13.603
11   2011/01/04 15:00:34.260
12   2011/01/04 15:02:55.687
13   2011/01/04 15:04:51.917
14   2011/01/04 15:06:24.760
15   2011/01/04 15:07:15.378

Output data:

ID  TS   Count
01   2011/01/03 19:18:00.300    3
02   2011/01/03 19:29:54.289    1
03   2011/01/04 14:43:43.067    3
04   2011/01/04 14:57:55.608    8

Does anyone have a solution to this problem? Already, grateful for the attention.

Sclerometer answered 1/7, 2015 at 11:46 Comment(1)
you want a sub select. if you give a table name and column names, I guess people would help you with queryCarbon
H
1

This may be a bit sub-optimal, but it works. The recursive query detects the start- and stop- times of the intervals; the count(*) scalar subquery counts the number of original records within each interval.

WITH RECURSIVE rr AS (
        SELECT 1::integer AS num
                , MIN(tscol) AS starter
                , MIN(tscol) + '10 min'::INTERVAL AS stopper
        FROM your_table
        UNION ALL
        SELECT
                1+rr.num AS num
                , tscol AS starter
                , tscol + '10 min'::INTERVAL AS stopper
        FROM your_table yt
        JOIN rr ON yt.tscol > rr.stopper
                AND NOT EXISTS ( SELECT *
                  FROM your_table nx
                  WHERE nx.tscol > rr.stopper
                  AND nx.tscol < yt.tscol
                )
        )
SELECT num,starter,stopper
        , (SELECT COUNT(*) FROM your_table yt
                WHERE yt.tscol BETWEEN rr.starter AND rr.stopper
        ) AS cnt
FROM rr
        ;
Honey answered 1/7, 2015 at 14:13 Comment(1)
Perfect. You are very good! Perfectly met my need. includes only GROUP BY num,starter,stopper ORDER BY num. Again, thank you for helpSclerometer
I
13

I have a table with millions of records in which precise group every 10 minutes

tl;dr: for the impatient ones, see the last query in the answer, which is the real solution, the others are step by step on how to get there. Also, all queries + schemas are available at SQLFiddle, for those who want to play with.

The best solution for such problem, in my opinion, is to truncate each timestamp into the start of its 10 minutes, for instance, let's try to make the following conversion (original -> 10 minutes truncated):

13:10 -> 13:10
13:15 -> 13:10
13:18 -> 13:10
13:20 -> 13:20
...

If any one want to try the following queries, you can create the schema as:

CREATE TABLE your_table(tscol timestamptz);
INSERT INTO your_table VALUES
('2011/01/03 19:18:00.300'),
('2011/01/03 19:18:00.503'),
('2011/01/03 19:20:26.335'),
('2011/01/03 19:29:54.289'),
('2011/01/04 14:43:43.067'),
('2011/01/04 14:50:10.727'),
('2011/01/04 14:52:26.827'),
('2011/01/04 14:57:55.608'),
('2011/01/04 14:57:55.718'),
('2011/01/04 14:59:13.603'),
('2011/01/04 15:00:34.260'),
('2011/01/04 15:02:55.687'),
('2011/01/04 15:07:15.378');

So, in order to do that, we need to understand date_trunc and date_part functions (the latter can be invoked by the standard EXTRACT) and interval data type. Let's build the solution step by step, the final idea is to have something like this (now a pseudo-code):

SELECT truncate_the_time_by_10_minutes(tscol) AS trunc10, count(*)
FROM your_table
GROUP BY trunc10
ORDER BY trunc10;

Now, if the problem was "aggregate by minute", then we could simple truncate the timestamp to the minute, which simple means zeroing seconds and microsseconds, which is exactly what date_trunc('minute', ...) does, so:

SELECT date_trunc('minute', tscol) AS trunc_minute, count(*)
FROM your_table
GROUP BY trunc_minute
ORDER BY trunc_minute;

Works, but it is not what you want, the next capability of date_trun is with 'hour', which would already loose the information we need, so we need something between 'minute' and 'hour'. Let's see how the above query works with some examples:

SELECT tscol, date_trunc('minute', tscol) AS trunc_minute
FROM your_table
ORDER BY tscol;

Which returns:

           tscol            |      trunc_minute      
----------------------------+------------------------
 2011-01-03 19:18:00.3-02   | 2011-01-03 19:18:00-02
 2011-01-03 19:18:00.503-02 | 2011-01-03 19:18:00-02
 2011-01-03 19:20:26.335-02 | 2011-01-03 19:20:00-02
 2011-01-03 19:29:54.289-02 | 2011-01-03 19:29:00-02
...

If you see 2011-01-03 19:18:00-02, now we just need to subtract 8 minutes, to do that we can:

  1. EXTRACT(MINUTE FROM tscol) will return 18
  2. As we want to truncate by 10 minutes, let's take the modulo of 18 and 10, so 18 % 10 which give us 8
  3. Now, we have the 8 minutes that we want to subtract, but as an integer, and to subtract from timestamp[tz] we need an interval, as the integer is representing minute, we can do: 8 * interval '1 minute', which will give us 00:08:00

Getting the 3 steps above in the last query, we have (I'll show each column to better understanding):

SELECT
    tscol,
    date_trunc('minute', tscol) AS trunc_minute,
    CAST(EXTRACT(MINUTE FROM tscol) AS integer) % 10 AS min_to_subtract,
    (CAST(EXTRACT(MINUTE FROM tscol) AS integer) % 10) * interval '1 minute' AS interval_to_subtract,
    date_trunc('minute', tscol) - (CAST(EXTRACT(MINUTE FROM tscol) AS integer) % 10) * interval '1 minute' AS solution
FROM your_table
ORDER BY tscol;

Which returns:

           tscol            |      trunc_minute      | min_to_subtract | interval_to_subtract |        solution        
----------------------------+------------------------+-----------------+----------------------+------------------------
 2011-01-03 19:18:00.3-02   | 2011-01-03 19:18:00-02 |               8 | 00:08:00             | 2011-01-03 19:10:00-02
 2011-01-03 19:18:00.503-02 | 2011-01-03 19:18:00-02 |               8 | 00:08:00             | 2011-01-03 19:10:00-02
 2011-01-03 19:20:26.335-02 | 2011-01-03 19:20:00-02 |               0 | 00:00:00             | 2011-01-03 19:20:00-02
 2011-01-03 19:29:54.289-02 | 2011-01-03 19:29:00-02 |               9 | 00:09:00             | 2011-01-03 19:20:00-02
...

Now, the last column is the solution we want, the timestamp truncated to its 10 minutes group, now we can simple aggregate and have our final solution:

SELECT
    date_trunc('minute', tscol) - (CAST(EXTRACT(MINUTE FROM tscol) AS integer) % 10) * interval '1 minute' AS trunc_10_minute,
    count(*)
FROM your_table
GROUP BY trunc_10_minute
ORDER BY trunc_10_minute;

Which returns:

    trunc_10_minute     | count 
------------------------+-------
 2011-01-03 19:10:00-02 |     2
 2011-01-03 19:20:00-02 |     2
 2011-01-04 14:40:00-02 |     1
 2011-01-04 14:50:00-02 |     5
 2011-01-04 15:00:00-02 |     5
(5 rows)

That is the exactly output you gave, but I believe it is what you actually expect, if not it is just a matter of small adjustment.

Implausibility answered 1/7, 2015 at 13:10 Comment(1)
Thank MatheusOl, Moreover it is not exactly what I need. Need group data by each line time interval, e.g., line 01 (2011/01/03 19:18:00.300) must add 10 minutes and collect all the line that are within this time interval, or , all records that are between 19:18:00.300 and 19:28:00.299. The result is 03 records.Sclerometer
H
1

This may be a bit sub-optimal, but it works. The recursive query detects the start- and stop- times of the intervals; the count(*) scalar subquery counts the number of original records within each interval.

WITH RECURSIVE rr AS (
        SELECT 1::integer AS num
                , MIN(tscol) AS starter
                , MIN(tscol) + '10 min'::INTERVAL AS stopper
        FROM your_table
        UNION ALL
        SELECT
                1+rr.num AS num
                , tscol AS starter
                , tscol + '10 min'::INTERVAL AS stopper
        FROM your_table yt
        JOIN rr ON yt.tscol > rr.stopper
                AND NOT EXISTS ( SELECT *
                  FROM your_table nx
                  WHERE nx.tscol > rr.stopper
                  AND nx.tscol < yt.tscol
                )
        )
SELECT num,starter,stopper
        , (SELECT COUNT(*) FROM your_table yt
                WHERE yt.tscol BETWEEN rr.starter AND rr.stopper
        ) AS cnt
FROM rr
        ;
Honey answered 1/7, 2015 at 14:13 Comment(1)
Perfect. You are very good! Perfectly met my need. includes only GROUP BY num,starter,stopper ORDER BY num. Again, thank you for helpSclerometer
S
0

Dynamic bounds (question asked)

The bounds of each group depend on the previous group. So it does not lend itself to set-based solutions. We have to walk the table in order to determine bounds.

WITH RECURSIVE cte AS (
   SELECT min(ts) AS ts
   FROM   tbl
   
   UNION ALL
   (
   SELECT t.ts
   FROM   cte c
   JOIN   tbl t ON t.ts > c.ts + interval '10 min'
   ORDER  BY t.ts
   LIMIT  1
   )
   )
SELECT row_number() OVER (ORDER BY c.ts) AS id  -- seems optional
     , c.ts, count(*)
FROM   cte c
JOIN   tbl t ON t.ts >= c.ts
            AND t.ts <= c.ts + interval '10 min'
GROUP  BY c.ts;

fiddle

Be sure to have a plain index on the timestamp column.
Basically another application of an emulated index skip scan. See:

wildplasser provided a working solution, but this is simpler and faster.

Fixed 10-min grid

Much simpler.

-- simpler (different!) solution with fixed 10-min slots using date_bin()
SELECT date_bin('10 min', ts, '2011-1-1') AS ts, count(*)
FROM   tbl
GROUP  BY 1
ORDER  BY 1;

fiddle

No index required. Results in a single sequential scan, hence faster.

date_bin() requires Postgres 14 or newer. See:

MatheusOl provided a working solution for older Postgres versions.

Spenser answered 6/11, 2023 at 7:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.