How to group timestamps into islands (based on arbitrary gap)?
Asked Answered
H

2

6

Consider this list of dates as timestamptz:

Postgres grouping dates

I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.

I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.

I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.

My question is: how would go about grouping dates in this way with Postgres?

(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)

Hoodlum answered 7/3, 2019 at 14:9 Comment(5)
have you looked at window functions? You can calculate the gap for each row using lead or lag postgresql.org/docs/8.4/functions-window.html.Agama
you may find this question and answers helpful #34339491.Simplicidentate
What you are looking for is gaps-and-islands.Marxmarxian
Thanks, I'm learning a lot. It's not easy to find basic examples of all three concepts (window function for dates / lag() / islands), but I'll reply to my own answer when I do.Hoodlum
Your problem is a classic SQL problem called "gaps and islands" writing it yourself is an exercise in algorithms because it involves SQL partitions over lagging rolling windows, best to copy and paste and unit test: https://mcmap.net/q/743349/-postgres-consecutive-days-gaps-and-islands-tabibitosanSerles
W
4
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM  (
   SELECT done
        , lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
   FROM   tbl
   ) sub
ORDER  BY done;

The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.

The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.

fiddle

Related:

About the aggregate FILTER clause:

Wendelina answered 8/3, 2019 at 14:16 Comment(0)
H
0

Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:

My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.

This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.

-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
    (
    SELECT -- get the three key values that will let us calculate total time spent
      sum(duration) as logged
      , count(island) as countofislands
      , EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval))  as avgreviewtime
    FROM
        (
        SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
        FROM
            (
            SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
            FROM (
                SELECT -- detect the beginning of islands
                    done,
                    (
                        lag(done) OVER (ORDER BY done) <= done - interval '2 min'
                    ) AS step
                FROM review
                WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
               ) sub
            ORDER BY done
            ) grouped
        GROUP BY island
        ) sessions
    ) summary
Hoodlum answered 12/3, 2019 at 9:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.