There are a number of researchers observing the world's last remaining unicorns, on Easter Island1. Each day the researchers record which unicorn they sighted, the date of the sighting, the number of babies each unicorn has and whether they were drunk when the sighting took place. These are individually uploaded to a central location, which then spits out a flat file to me of all new observations each day.
I have a table that looks like this to contain the information:
create table unicorn_observations (
observer_id number not null
, unicorn_id number not null
, created date not null -- date the record was inserted into the database
, lastseen date not null -- date the record was last seen
, observation_date date not null
, no_of_babies number not null
, drunk varchar2(1) not null
, constraint pk_uo primary key ( observer_id, unicorn_id, created )
, constraint chk_uo_babies check ( no_of_babies >= 0 )
, constraint chk_uo_drunk check ( drunk in ('y','n') )
);
The table is separately unique on observer_id
, unicorn_id
and observation_date
or lastseen
.
Sometimes the Cobold [sic] managing the output of data gets it slightly wrong and re-outputs the same data twice. In this situation I update the lastseen
instead of creating a new record. I only do this in situations where every column is the same
Unfortunately, the researchers aren't fully aware of the third normal form. Each month they upload the previous months observations for a few unicorns, even if no new observations have been made. They do this with a new observation_date
, which means a new record gets inserted into the table.
I have a separate created
and lastseen
for full traceability as the researchers sometimes submit some observations late. These are created by the database and are not part of the submitted information.
Here is some sample data (with partially changed column names in order to make it fit without a scroll bar).
+--------+--------+-----------+-----------+-----------+---------+-------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | +--------+--------+-----------+-----------+-----------+---------+-------+ | 1 | 1 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 2 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 3 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 6 | 10-NOV-11 | 10-NOV-11 | 07-NOV-11 | 0 | n | | 1 | 1 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 2 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 3 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 6 | 17-NOV-11 | 17-NOV-11 | 17-NOV-11 | 0 | n | | 1 | 6 | 01-DEC-11 | 01-DEC-11 | 01-DEC-11 | 0 | n | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | 1 | 6 | 01-FEB-12 | 01-FEB-12 | 01-FEB-12 | 0 | n | | 1 | 6 | 01-MAR-12 | 01-MAR-12 | 01-MAR-12 | 0 | n | | 1 | 6 | 01-APR-12 | 01-APR-12 | 01-APR-12 | 0 | n | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 6 | 01-MAY-12 | 01-MAY-12 | 01-MAY-12 | 0 | n | +--------+--------+-----------+-----------+-----------+---------+-------+
I would like to partially denormalise these observations so that if a new record is received with the same observer_id
, unicorn_id
, no_of_babies
and drunk
(the payload) but with a newer observation_date
I update a new column in the table, last_observation_date
, instead of inserting a new record. I would still update thelastseen
in this situation.
I need to do this as I have a number of complicated unicorn related queries that join to this table; the researchers upload old observations with new dates about 10m times a month and I receive approximately 9m genuinely new records a month. I've been running for a year and already have 225m unicorn observations. As I only need to know the last observation date for each payload combination I would rather massively reduce the size of the table and save myself a lot of time full-scanning it.
This means that the table would become:
create table unicorn_observations (
observer_id number not null
, unicorn_id number not null
, created date not null -- date the record was inserted into the database
, lastseen date not null -- date the record was last seen
, observation_date date not null
, no_of_babies number not null
, drunk varchar2(1) not null
, last_observation_date date
, constraint pk_uo primary key ( observer_id, unicorn_id, created )
, constraint chk_uo_babies check ( no_of_babies >= 0 )
, constraint chk_uo_drunk check ( drunk in ('y','n') )
);
and the data stored in the table would look like the below; it doesn't matter whether last_observation_date
is null or not if the observation has only been "seen" once. I do not need help in loading the data, only in partially denormalising the current table to look like this.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 | | 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | | 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
The obvious answer
select observer_id as obs_id
, unicorn_id as uni_id
, min(created) as created
, max(lastseen) as lastseen
, min(observation_date) as obs_date
, no_of_babies as "#BABIES"
, drunk
, max(observation_date) as last_obs_date
from unicorn_observations
group by observer_id
, unicorn_id
, no_of_babies
, drunk
doesn't work as it ignores the single observation of 3 unicorn babies for unicorn 6 on the 1st January 2012; this in turn means that the lastseen
for the record created on the 10th November is incorrect.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
I do not currently see a way of doing this without some procedural logic, i.e. a loop. I would much rather avoid a loop in this situation as I would have to full-scan a 225m row table 260 times (number of distinct created
dates). Even using lag()
and lead()
would need to be recursive as there is an indeterminate amount of observations per unicorn.
Is there a way of creating this data-set in a single SQL statement?
The table specification and sample data is also in a SQL Fiddle.
Attempted better explanation:
The problem is maintaining when something was true. On 01-Jan-2012 unicorn 6 had 3 babies.
Looking at just unicorn 6 in the "table" created by the GROUP BY; if I try to find the number of babies on the 1st of January I will get two records returned, which is a contradiction.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
However, I would want only one row, as in the second table. Here, for any point in time there is at most one "correct" value because the two periods of time where unicorn 6 had 0 babies have been separated into two rows by the day when it had 3.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | | 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
1. grazing around the moai
group by
the group by has merged to "0 baby" records into one, which means that on 01-Jan-2012 there are two contradictory records. – Etherege