Partially denormalising unicorn observations

E

3

6

There are a number of researchers observing the world's last remaining unicorns, on Easter Island¹. Each day the researchers record which unicorn they sighted, the date of the sighting, the number of babies each unicorn has and whether they were drunk when the sighting took place. These are individually uploaded to a central location, which then spits out a flat file to me of all new observations each day.

I have a table that looks like this to contain the information:

create table unicorn_observations (
     observer_id number not null
   , unicorn_id number not null
   , created date not null -- date the record was inserted into the database
   , lastseen date not null -- date the record was last seen
   , observation_date date not null
   , no_of_babies number not null
   , drunk varchar2(1) not null
   , constraint pk_uo primary key ( observer_id, unicorn_id, created )
   , constraint chk_uo_babies check ( no_of_babies >= 0 )
   , constraint chk_uo_drunk check ( drunk in ('y','n') )
     );

The table is separately unique on observer_id, unicorn_id and observation_date or lastseen.

Sometimes the Cobold [sic] managing the output of data gets it slightly wrong and re-outputs the same data twice. In this situation I update the lastseen instead of creating a new record. I only do this in situations where every column is the same

Unfortunately, the researchers aren't fully aware of the third normal form. Each month they upload the previous months observations for a few unicorns, even if no new observations have been made. They do this with a new observation_date, which means a new record gets inserted into the table.

I have a separate created and lastseen for full traceability as the researchers sometimes submit some observations late. These are created by the database and are not part of the submitted information.

Here is some sample data (with partially changed column names in order to make it fit without a scroll bar).

+--------+--------+-----------+-----------+-----------+---------+-------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK |
+--------+--------+-----------+-----------+-----------+---------+-------+
|      1 |      1 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      2 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      3 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      6 | 10-NOV-11 | 10-NOV-11 | 07-NOV-11 |       0 | n     |
|      1 |      1 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      2 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      3 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      6 | 17-NOV-11 | 17-NOV-11 | 17-NOV-11 |       0 | n     |
|      1 |      6 | 01-DEC-11 | 01-DEC-11 | 01-DEC-11 |       0 | n     |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |
|      1 |      6 | 01-FEB-12 | 01-FEB-12 | 01-FEB-12 |       0 | n     |
|      1 |      6 | 01-MAR-12 | 01-MAR-12 | 01-MAR-12 |       0 | n     |
|      1 |      6 | 01-APR-12 | 01-APR-12 | 01-APR-12 |       0 | n     |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      6 | 01-MAY-12 | 01-MAY-12 | 01-MAY-12 |       0 | n     |
+--------+--------+-----------+-----------+-----------+---------+-------+

I would like to partially denormalise these observations so that if a new record is received with the same observer_id, unicorn_id, no_of_babies and drunk (the payload) but with a newer observation_date I update a new column in the table, last_observation_date, instead of inserting a new record. I would still update thelastseen in this situation.

I need to do this as I have a number of complicated unicorn related queries that join to this table; the researchers upload old observations with new dates about 10m times a month and I receive approximately 9m genuinely new records a month. I've been running for a year and already have 225m unicorn observations. As I only need to know the last observation date for each payload combination I would rather massively reduce the size of the table and save myself a lot of time full-scanning it.

This means that the table would become:

create table unicorn_observations (
     observer_id number not null
   , unicorn_id number not null
   , created date not null -- date the record was inserted into the database
   , lastseen date not null -- date the record was last seen
   , observation_date date not null
   , no_of_babies number not null
   , drunk varchar2(1) not null
   , last_observation_date date
   , constraint pk_uo primary key ( observer_id, unicorn_id, created )
   , constraint chk_uo_babies check ( no_of_babies >= 0 )
   , constraint chk_uo_drunk check ( drunk in ('y','n') )
     );

and the data stored in the table would look like the below; it doesn't matter whether last_observation_date is null or not if the observation has only been "seen" once. I do not need help in loading the data, only in partially denormalising the current table to look like this.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 |       0 | n     | 01-DEC-11   |
|      1 |      1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |             |
|      1 |      6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 |       0 | n     | 01-MAY-12   |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

The obvious answer

select observer_id as obs_id
     , unicorn_id as uni_id
     , min(created) as created
     , max(lastseen) as lastseen
     , min(observation_date) as obs_date
     , no_of_babies as "#BABIES"
     , drunk
     , max(observation_date) as last_obs_date
  from unicorn_observations
 group by observer_id
        , unicorn_id
        , no_of_babies
        , drunk

doesn't work as it ignores the single observation of 3 unicorn babies for unicorn 6 on the 1^st January 2012; this in turn means that the lastseen for the record created on the 10^th November is incorrect.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 |       0 | n     | 01-MAY-12   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     | 01-JAN-12   |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

I do not currently see a way of doing this without some procedural logic, i.e. a loop. I would much rather avoid a loop in this situation as I would have to full-scan a 225m row table 260 times (number of distinct created dates). Even using lag() and lead() would need to be recursive as there is an indeterminate amount of observations per unicorn.

Is there a way of creating this data-set in a single SQL statement?

The table specification and sample data is also in a SQL Fiddle.

Attempted better explanation:

The problem is maintaining when something was true. On 01-Jan-2012 unicorn 6 had 3 babies.

Looking at just unicorn 6 in the "table" created by the GROUP BY; if I try to find the number of babies on the 1^st of January I will get two records returned, which is a contradiction.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 |       0 | n     | 01-MAY-12   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     | 01-JAN-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

However, I would want only one row, as in the second table. Here, for any point in time there is at most one "correct" value because the two periods of time where unicorn 6 had 0 babies have been separated into two rows by the day when it had 3.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 |       0 | n     | 01-DEC-11   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |             |
|      1 |      6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 |       0 | n     | 01-MAY-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

_{1. grazing around the moai}

Etherege answered 4/10, 2012 at 9:1 Comment(9)

So you're looking to identify changes in babies/drunkness for each unicorn over time? – Outshine 4/10, 2012 at 9:20

I have changes in babies/drunkenness over time; I'm looking to identify where one record is the same as the previous record. – Etherege 4/10, 2012 at 9:21

Only if you're very good :-). – Etherege 4/10, 2012 at 12:37

I'm confused tho. You say "if a new record is received with the same observer_id, unicorn_id, no_of_babies and drunk (the payload) but with a newer observation_date I update a new column in the table, last_observation_date". So this is exactly what you get from your GROUP BY. In other words, the last time obs 1 observed unicorn 6 with 3 babies and wasn't drunk was Jan 1 2012. – Resume 4/10, 2012 at 12:41

If you look at where there are 0 babies they are either side of the record with 3. The group by makes these concurrent at some point (which is wrong). – Etherege 4/10, 2012 at 12:50

But those 2 rows both have the same: observer_id, unicorn_id, no_babies, and drunk. Isn't this your payload, so don't you want 1 row with the greatest (last) observation date? – Resume 4/10, 2012 at 12:57

I do but in a temporal manner. If you look at the difference between the required results and the group by the group by has merged to "0 baby" records into one, which means that on 01-Jan-2012 there are two contradictory records. – Etherege 4/10, 2012 at 13:1

I agree with @tbone. I'm glad I'm not the only one who's confused. The records generated by your posted query match your business rules. So, if they aren't what you want you need to clarify your rules to highlight the problem. (I think I can guess but I'd ratehr you just explained the problem.) – Penelopa 4/10, 2012 at 13:28

@APC; thanks for taking an interest! Does my edit (at the bottom) make more sense? – Etherege 4/10, 2012 at 13:46

W

1

Based on what I think you're trying to do, largely on your update regarding the specific issues with unicorn 6, I think this gets the result you want. It doesn't need recursive lead and lag, but does need two levels.

select *
from (
    select observer_id, unicorn_id,
        case when first_obs_dt is null then created
            else lag(created) over (order by rn) end as created,
        case when last_obs_dt is null then lastseen
            else lead(lastseen) over (order by rn) end as lastseen,
        case when first_obs_dt is null then observation_date
            else lag(observation_date) over (order by rn)
            end as observation_date,
        no_of_babies,
        drunk,
        case when last_obs_dt is null then observation_date
            else null end as last_obs_dt
    from (
        select observer_id, unicorn_id, created, lastseen, 
            observation_date, no_of_babies, drunk,
            case when lag_no_babies != no_of_babies or lag_drunk != drunk
                or lag_obs_dt is null then null
                else lag_obs_dt end as first_obs_dt,
            case when lead_no_babies != no_of_babies or lead_drunk != drunk
                or lead_obs_dt is null then null
                else lead_obs_dt end as last_obs_dt,
            rownum rn
        from (
            select observer_id, unicorn_id, created, lastseen,
                observation_date, no_of_babies, drunk,
                lag(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                            drunk
                        order by observation_date) lag_obs_dt,
                lag(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lag_no_babies,
                lag(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lag_drunk,
                lead(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                        drunk
                    order by observation_date) lead_obs_dt,
                lead(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lead_no_babies,
                lead(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lead_drunk
            from unicorn_observations
            order by 1,2,5
        )
    )
    where first_obs_dt is null or last_obs_dt is null
)
where last_obs_dt is not null
order by 1,2,3,4;

Which gives:

OBSERVER_ID UNICORN_ID CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
----------- ---------- --------- --------- --------- ------------ - ---------
          1          1 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          1 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          2 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          2 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          3 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          3 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          6 10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
          1          6 01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
          1          6 01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

9 rows selected.

It's got the three records for unicorn 6, but the lastseen and observation_date for the third are the opposite way round to your sample, so I'm not sure if I'm still not understanding that. I've assumed that you want to keep the earliest observation_date and latest lastseen within each grouping, on the grounds that it seems to be what would happen when adding new records, but I'm not sure...

So, the innermost query get the raw data from the table and gets a lead and lag for the observation_date and the no_of_babies and drunk columns using slightly different partitions. The order by is so a rownum can be used later, obtained in the next step and used for ordering in the one after that. Just for unicorn 6 for brevity:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
10-NOV-11 10-NOV-11 07-NOV-11            0 n                           17-NOV-11              0 n
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11             0 n 01-DEC-11              0 n
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11             0 n 01-FEB-12              3 n
01-JAN-12 01-JAN-12 01-JAN-12            3 n                       0                          0
01-FEB-12 01-FEB-12 01-FEB-12            0 n 01-DEC-11             3 n 01-MAR-12              0 n
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12             0 n 01-APR-12              0 n
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12             0 n 01-MAY-12              0 n
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12             0 n

The next level blanks out the lead and lag values for observation_date if either the num_of_babies or drunk value has changed - you only specifically referred to splitting on the baby count, but I assume you want to split on sobriety too. After this, anything that has null for either first_obs_date or last_obs_date is the start or end of a mini-range.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
10-NOV-11 10-NOV-11 07-NOV-11            0 n           17-NOV-11          1
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11 01-DEC-11          2
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11                    3
01-JAN-12 01-JAN-12 01-JAN-12            3 n                              4
01-FEB-12 01-FEB-12 01-FEB-12            0 n           01-MAR-12          5
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12 01-APR-12          6
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12 01-MAY-12          7
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12                    8

Anything that isn't the start or end of a mini-range can now be ignored, as the values are either same as or are superseded by those before or after. This deals with the indeterminate number of observations problem - it doesn't matter how many you ignore at this point. So the next level eliminates those intermediate values by filtering rows where both first_obs_dt and last_obs_dt are non-null. Within that filtered set there's a second layer of lead and lag to get the first or last value for each date - and that's the bit I'm not sure is right as it doesn't match one of your samples.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
10-NOV-11 01-DEC-11 07-NOV-11            0 n
10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
01-FEB-12 01-MAY-12 01-FEB-12            0 n
01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

Finally the remaining rows that don't have a last_obs_dt are filtered out.

Now I'll wait to see which bit(s) I've misunderstood... *8-)

Following correction to lead and lag ordering, the same info for each stage for unicorn 1:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
17-NOV-11 17-NOV-11 09-APR-11           10 n                           31-OCT-11             10 n
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11            10 n
19-APR-12 19-APR-12 19-APR-12            7 y

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
17-NOV-11 17-NOV-11 09-APR-11           10 n           31-OCT-11          1
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11                    2
19-APR-12 19-APR-12 19-APR-12            7 y                              3

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
17-NOV-11 17-NOV-11 09-APR-11           10 n 09-APR-11
19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12

I'm not sure what shoudl happen with the preserved observation_date and lastseen when the original data was entered out of sequence like this, or what you'll do in that situation with new records added in the future.

Will answered 4/10, 2012 at 22:53 Comment(6)

I'm not sure you've misunderstood anything! You've identified an error in my sample data, which your query got correct. There is something weird happening with the first row for unicorns 1, 2 & 3, which could easily be fixed by a least and greatest but I'm going to try to understand why before just running it. Thank you very much. – Etherege 5/10, 2012 at 8:5

@Etherege - hmm, I was concentrating too much on getting it right for 6, and missed that it was out for the others. I think it's because the lead and lag in the innermost query are ordered by created, and for those the later observation_date (31-OCT-11) has an earlier created (01-Nov-11). Just ordering by the other date doesn't fix it, so I'll look into it some more. Not sure if the original observation date would be updated in your new world, if that happened? – Will 5/10, 2012 at 8:41

@Etherege - well maybe it does fix it, if I change the ordering used to generate the rownum too. I've adjusted it and shown my working for unicorn 1 too. I'm not sure what you want to happen to the created and lastseen values in this situation though. – Will 5/10, 2012 at 9:0

I'll investigate and report back to you; sorry ran out of time today. – Etherege 5/10, 2012 at 15:34

I'm so sorry, work abducted me for a month. I'm accepting this answer as you've got so close and I'm fairly sure that I can modify it as necessary (when I have the time) to do what I need. Thank you! – Etherege 11/11, 2012 at 18:38

@Etherege - work has a nasty habit of doing that. I'm starting to worry that I find this sort of thing far too much fun... – Will 11/11, 2012 at 18:52

O

2

Try this.

with cte as
(
    select v.*,  ROW_NUMBER() over (partition by grp, unicorn_id order by grp, unicorn_id) rn
    from
    (
        select u.*, 
            ROW_NUMBER() over (partition by unicorn_id order by no_of_babies, drunk, created )
            -ROW_NUMBER() over (partition by unicorn_id order by created) as grp
        from unicorn_observations u
    ) v
) 
    select 
        observer_id, cte.unicorn_id, mincreated,maxlastseen,minobsdate,no_of_babies,drunk,maxobsdate
    from cte 
        inner join 
        (    
            select 
                unicorn_id, grp, 
                min(created) as mincreated,
                max(lastseen) as maxlastseen, 
                min(observation_date) as minobsdate,
                max(observation_date) as maxobsdate
            from cte 
            group by unicorn_id, grp
        ) v
        on cte.grp = v.grp
        and cte.unicorn_id = v.unicorn_id
    where rn=1  
    order by created;

Outshine answered 4/10, 2012 at 10:31 Comment(5)

It's not a comment - it's a minus - Subtracting one from the other. – Outshine 4/10, 2012 at 11:14

Sorry, didn't realise that. It's so close... but as I don't 100% understand what's going on here it's quite difficult to fix; it looks like the first record is affected strangely and for all the others it's the most recent. – Etherege 4/10, 2012 at 12:27

@ben not sure what you mean by the first record? It produces the output you were after for your test data, no? – Outshine 4/10, 2012 at 12:41

No, the lastseen and last observation dates are different and there seems to be something strange going on with how create is calculated for the first unicorn 3. – Etherege 4/10, 2012 at 13:4

@Etherege I've tweaked the query, re the anomalies above. – Outshine 5/10, 2012 at 20:21

W

1

Based on what I think you're trying to do, largely on your update regarding the specific issues with unicorn 6, I think this gets the result you want. It doesn't need recursive lead and lag, but does need two levels.

select *
from (
    select observer_id, unicorn_id,
        case when first_obs_dt is null then created
            else lag(created) over (order by rn) end as created,
        case when last_obs_dt is null then lastseen
            else lead(lastseen) over (order by rn) end as lastseen,
        case when first_obs_dt is null then observation_date
            else lag(observation_date) over (order by rn)
            end as observation_date,
        no_of_babies,
        drunk,
        case when last_obs_dt is null then observation_date
            else null end as last_obs_dt
    from (
        select observer_id, unicorn_id, created, lastseen, 
            observation_date, no_of_babies, drunk,
            case when lag_no_babies != no_of_babies or lag_drunk != drunk
                or lag_obs_dt is null then null
                else lag_obs_dt end as first_obs_dt,
            case when lead_no_babies != no_of_babies or lead_drunk != drunk
                or lead_obs_dt is null then null
                else lead_obs_dt end as last_obs_dt,
            rownum rn
        from (
            select observer_id, unicorn_id, created, lastseen,
                observation_date, no_of_babies, drunk,
                lag(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                            drunk
                        order by observation_date) lag_obs_dt,
                lag(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lag_no_babies,
                lag(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lag_drunk,
                lead(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                        drunk
                    order by observation_date) lead_obs_dt,
                lead(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lead_no_babies,
                lead(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lead_drunk
            from unicorn_observations
            order by 1,2,5
        )
    )
    where first_obs_dt is null or last_obs_dt is null
)
where last_obs_dt is not null
order by 1,2,3,4;

Which gives:

OBSERVER_ID UNICORN_ID CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
----------- ---------- --------- --------- --------- ------------ - ---------
          1          1 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          1 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          2 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          2 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          3 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          3 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          6 10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
          1          6 01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
          1          6 01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

9 rows selected.

It's got the three records for unicorn 6, but the lastseen and observation_date for the third are the opposite way round to your sample, so I'm not sure if I'm still not understanding that. I've assumed that you want to keep the earliest observation_date and latest lastseen within each grouping, on the grounds that it seems to be what would happen when adding new records, but I'm not sure...

So, the innermost query get the raw data from the table and gets a lead and lag for the observation_date and the no_of_babies and drunk columns using slightly different partitions. The order by is so a rownum can be used later, obtained in the next step and used for ordering in the one after that. Just for unicorn 6 for brevity:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
10-NOV-11 10-NOV-11 07-NOV-11            0 n                           17-NOV-11              0 n
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11             0 n 01-DEC-11              0 n
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11             0 n 01-FEB-12              3 n
01-JAN-12 01-JAN-12 01-JAN-12            3 n                       0                          0
01-FEB-12 01-FEB-12 01-FEB-12            0 n 01-DEC-11             3 n 01-MAR-12              0 n
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12             0 n 01-APR-12              0 n
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12             0 n 01-MAY-12              0 n
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12             0 n

The next level blanks out the lead and lag values for observation_date if either the num_of_babies or drunk value has changed - you only specifically referred to splitting on the baby count, but I assume you want to split on sobriety too. After this, anything that has null for either first_obs_date or last_obs_date is the start or end of a mini-range.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
10-NOV-11 10-NOV-11 07-NOV-11            0 n           17-NOV-11          1
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11 01-DEC-11          2
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11                    3
01-JAN-12 01-JAN-12 01-JAN-12            3 n                              4
01-FEB-12 01-FEB-12 01-FEB-12            0 n           01-MAR-12          5
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12 01-APR-12          6
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12 01-MAY-12          7
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12                    8

Anything that isn't the start or end of a mini-range can now be ignored, as the values are either same as or are superseded by those before or after. This deals with the indeterminate number of observations problem - it doesn't matter how many you ignore at this point. So the next level eliminates those intermediate values by filtering rows where both first_obs_dt and last_obs_dt are non-null. Within that filtered set there's a second layer of lead and lag to get the first or last value for each date - and that's the bit I'm not sure is right as it doesn't match one of your samples.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
10-NOV-11 01-DEC-11 07-NOV-11            0 n
10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
01-FEB-12 01-MAY-12 01-FEB-12            0 n
01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

Finally the remaining rows that don't have a last_obs_dt are filtered out.

Now I'll wait to see which bit(s) I've misunderstood... *8-)

Following correction to lead and lag ordering, the same info for each stage for unicorn 1:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
17-NOV-11 17-NOV-11 09-APR-11           10 n                           31-OCT-11             10 n
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11            10 n
19-APR-12 19-APR-12 19-APR-12            7 y

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
17-NOV-11 17-NOV-11 09-APR-11           10 n           31-OCT-11          1
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11                    2
19-APR-12 19-APR-12 19-APR-12            7 y                              3

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
17-NOV-11 17-NOV-11 09-APR-11           10 n 09-APR-11
19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12

I'm not sure what shoudl happen with the preserved observation_date and lastseen when the original data was entered out of sequence like this, or what you'll do in that situation with new records added in the future.

Will answered 4/10, 2012 at 22:53 Comment(6)

I'm not sure you've misunderstood anything! You've identified an error in my sample data, which your query got correct. There is something weird happening with the first row for unicorns 1, 2 & 3, which could easily be fixed by a least and greatest but I'm going to try to understand why before just running it. Thank you very much. – Etherege 5/10, 2012 at 8:5

@Etherege - hmm, I was concentrating too much on getting it right for 6, and missed that it was out for the others. I think it's because the lead and lag in the innermost query are ordered by created, and for those the later observation_date (31-OCT-11) has an earlier created (01-Nov-11). Just ordering by the other date doesn't fix it, so I'll look into it some more. Not sure if the original observation date would be updated in your new world, if that happened? – Will 5/10, 2012 at 8:41

@Etherege - well maybe it does fix it, if I change the ordering used to generate the rownum too. I've adjusted it and shown my working for unicorn 1 too. I'm not sure what you want to happen to the created and lastseen values in this situation though. – Will 5/10, 2012 at 9:0

I'll investigate and report back to you; sorry ran out of time today. – Etherege 5/10, 2012 at 15:34

I'm so sorry, work abducted me for a month. I'm accepting this answer as you've got so close and I'm fairly sure that I can modify it as necessary (when I have the time) to do what I need. Thank you! – Etherege 11/11, 2012 at 18:38

@Etherege - work has a nasty habit of doing that. I'm starting to worry that I find this sort of thing far too much fun... – Will 11/11, 2012 at 18:52

P

0

This type of problem can be solved by first creating some flags in a subquery, then using them.

with obs_flags as (
    select 
       observer_id as obs_id
     , unicorn_id as uni_id
     , case when lag(observation_date) over (
           partition by unicorn_id, no_of_babies, drunk
           order by unicorn_id, observation_date
       ) is null then 1 else 0 end as group_start
     , case when lead(observation_date) over (
           partition by unicorn_id, no_of_babies,drunk
           order by unicorn_id, observation_date
       ) is null then 1 else 0 end as group_end
     , observation_date
     , no_of_babies
     , drunk
     , lastseen
     , created
  from unicorn_observations
)
select obs_start.obs_id
     , obs_start.uni_id
     , obs_start.created
     , obs_end.lastseen as lastseen
     , obs_start.observation_date
     , obs_start.no_of_babies as "#BABIES"
     , obs_start.drunk
     , obs_end.observation_date as last_obs_date
  from obs_flags obs_start
  join obs_flags obs_end on 
      obs_start.group_start = 1 and
      obs_end.group_end = 1 and
      obs_start.uni_id = obs_end.uni_id and
      obs_start.no_of_babies = obs_end.no_of_babies and
      obs_start.drunk = obs_end.drunk and
      obs_start.observation_date <= obs_end.observation_date and
      --Only join with the first end point we find:
      not exists (
          select * from obs_flags f where
              obs_start.uni_id = f.uni_id and
              obs_start.no_of_babies = f.no_of_babies and
              obs_start.drunk = f.drunk and
              f.group_end = 1 and
              f.observation_date < obs_end.observation_date and
              f.observation_date >= obs_start.observation_date
      );

This is a complex problem; I may have not quite met your requirements (or there could be a typo in there. I don't have Oracle to test it). However, it should give you an idea of how it can be done.

Basically, you first find all start and end records of the periods you are interested in. Then you join each start record to the next end record within the same grouping.

Update: my original code didn't check that the end came after the start. I fixed that.

Update2: as Ben pointed out, the not exists clause will be slow here. An alternative that has helped me speed things up in the past is to do this in two steps: first find all potential pairings, then separately select only the correct pairings out of that.

In this case, in a temporary table or subquery join each obs_start to every potentially correct obs_end.

Then, out of these pairings, select the one that has the earliest obs_end for each obs_start.

Puissant answered 4/10, 2012 at 9:46 Comment(2)

I've fixed your syntax errors; I think your partition clauses are incorrect, though I can see where you're going. My one concern would be the not exists, which is going to get very messy on a table this size though I have more than enough UNDO to do it. – Etherege 4/10, 2012 at 10:8

@Ben, sorry if there are syntax issues; my experience is with DB2 rather than Oracle. You're right that the not exists will be slow. I updated the answer with a potential alternative. – Puissant 4/10, 2012 at 11:1

Recommended topics

Hot tags