How to group following rows by not unique value

Asked 16/6, 2015 at 20:50 Answered 16/6, 2015 at 23:50

Solved sql postgresql greatest-n-per-group window-functions gaps-and-islands

I have data like this:

table1
_____________
id way time
1  1   00:01
2  1   00:02
3  2   00:03
4  2   00:04
5  2   00:05
6  3   00:06
7  3   00:07
8  1   00:08
9  1   00:09

I would like to know in which time interval I was on which way:

desired output
_________________
id  way from   to    
1   1   00:01  00:02
3   2   00:03  00:05
6   3   00:06  00:07
8   1   00:08  00:09

I tried to use a window function:

SELECT DISTINCT
  first_value(id) OVER w AS id, 
  first_value(way) OVER w as way,
  first_value(time) OVER w as from,
  last_value(time) OVER w as to
FROM table1
WINDOW w AS (
  PARTITION BY way ORDER BY ID
  range between unbounded preceding and unbounded following);

What I get is:

ID  way from   to    
 1   1  00:01  00:09
 3   2  00:03  00:05
 6   3  00:06  00:07

And this is not correct, because on way 1 I wasn't from 00:01 to 00:09. Is there a possibility to do the partition according to the order, means grouping only following attributes, that are equal?

Mutualism answered 16/6, 2015 at 20:50 Comment(4)

How is way 2 00:03 - 00:05 and way 3 00:06-00:07 ? This is very confusing. – Cloots 16/6, 2015 at 20:59

It was wrong, I fixed it. Thx. – Mutualism 16/6, 2015 at 21:7

You make it seem like id and time both would be strictly ascending in parallel. Is that so? Are you sure? If id is a serial column, that's most probably not always the case. This would mean that the minimum id and minimum time for one time slice could be in different rows. What should be in the result then? – Hurl 16/6, 2015 at 22:56

No they are not. The id is strictly ascending in the order I drove through the ways. However the doesn't need to be unique, some ways have the same start and end time (end time is basically the start time of the next way, not shown in the example) and theoretically the start time of the following line could be before (and so the ending time). I got the data from a mapmatch of GPS tracks to a Openstreetmap network. The ways are in the right order. However I assigned the timestamp to the edge by joining the nearest neighbor GPS point, which could leed to errors. – Mutualism 17/6, 2015 at 7:30

If your case is as simple as your example suggests, @Giorgos' answer serves nicely.

However, that's typically not the case. With a serial id column you cannot assume that a row with an earlier time also has a smaller id.
Also, time values (timestamp like you probably have) can easily be duplicates, you need to make the sort order unambiguous.

Assuming both can happen, and you want the id from the row with the earliest time per time slice (actually, the smallest id for the earliest time, there could be ties), this query would deal with the situation properly:

SELECT *
FROM  (
   SELECT DISTINCT ON (way, grp)
          id, way, time AS time_from
        , max(time) OVER (PARTITION BY way, grp) AS time_to
   FROM (
      SELECT *
           , row_number() OVER (ORDER BY time, id)  -- id as tie breaker
           - row_number() OVER (PARTITION BY way ORDER BY time, id) AS grp
      FROM   table1
      ) t
   ORDER  BY way, grp, time, id
   ) sub
ORDER  BY time_from, id;

ORDER BY time, id to be unambiguous. Assuming time is not unique, add the (assumed unique) id to avoid arbitrary results - that could change between queries in sneaky ways.
max(time) OVER (PARTITION BY way, grp): without ORDER BY, the window frame spans all rows of the PARTITION, so we get the absolute maximum per time slice.
The outer query layer is only necessary to produce the desired sort order in the result, since we are bound to a different ORDER BY in the subquery sub by using DISTINCT ON. Details:
Select first row in each GROUP BY group?

sqlfiddle (currently offline)

If you are looking to optimize performance, a PL/pgSQL function could be faster in such a case. See:

Group by repeating attribute

Aside: don't use the basic type name time as identifier (also a reserved word in standard SQL).

Hurl answered 16/6, 2015 at 23:50 Comment(0)

I think you want something like this:

select min(id), way, 
       min(time), max(time)
from (
select id, way, time,
       ROW_NUMBER() OVER (ORDER BY id) - 
       ROW_NUMBER() OVER (PARTITION BY way ORDER BY time) AS grp
from table1 ) t
group by way, grp

grp identifies 'islands' of successive way values. Using this calculated field in an outer query, we can get start and end times of way intervals using MIN and MAX aggregate functions respectively.

Demo here

Dimpledimwit answered 16/6, 2015 at 21:5 Comment(2)

@Nassim The OP wants to identify islands of successive way values. There are 4 of them in the sample data posted. Please have a look at desired output and not at what I get output. – Dimpledimwit 16/6, 2015 at 21:31

yes i misunderstood the question , so i deleted my answer, in that case your answer is more accurate +1 – Vacillate 16/6, 2015 at 21:45

Recommended topics

Hot tags