How to create TimescaleDB Hypertable with time partitioning on non unique timestamp?
Asked Answered
B

5

24

I have just started to use TimescaleDB and want to create a hypertable on a table with events. Originally I thought of following the conventional pattern of:

CREATE TABLE event (
  id serial PRIMARY KEY,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL
);

CREATE INDEX event_ts_idx on event(ts);

However, when I tried to create the hypertable with the following query:

SELECT create_hypertable('event', 'ts');

I got: ERROR: cannot create a unique index without the column "ts" (used in partitioning)

After doing some research, it seems that the timestamp itself needs to be the (or part of the) primary key.

However, I do not want the timestamp ts to be unique. It is very likely that these high frequency events will coincide in the same microsecond (the maximum resolution of the timestamp type). It is the whole reason why I am looking into TimescaleDB in the first place.

What is the best practice in this case?

I was thinking of maybe keeping the serial id as part of the primary key, and making it composite like this:

CREATE TABLE event_hyper (
  id serial,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL,
  PRIMARY KEY (id, ts)
);

SELECT create_hypertable('event_hyper', 'ts');

This sort of works, but I am unsure if it is the right approach, or if I am creating a complicated primary key which will slow down inserts or create other problems.

What is the right approach when you have possible collision in timestamps when using TimescaleDB hypertables?

Boykins answered 3/2, 2021 at 17:4 Comment(2)
If "ts" is only part of the primary key, then it does not need to be unique. You have answered your own questionWax
@Wax well i don't even want it to be the primary key. I only did it as a workaround because of timescaleDB's restriction that it needs to be part of the primary key. What i am asking is whether it makes sense.Boykins
B
24

How to create TimescaleDB Hypertable with time partitioning on non unique timestamp?

There is no need to create unique constraint on time dimension (unique constraints are not required). This works:

CREATE TABLE event (
  id serial,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL
);
SELECT create_hypertable('event', 'ts');

Note that the primary key on id is removed.

If you want to create unique constraint or primary key, then TimescaleDB requires that any unique constraint or primary key includes the time dimension. This is similar to limitation of PostgreSQL in declarative partitioning to include partition key into unique constraint:

Unique constraints (and hence primary keys) on partitioned tables must include all the partition key columns. This limitation exists because PostgreSQL can only enforce uniqueness in each partition individually.

TimescaleDB also enforces uniqueness in each chunk individually. Maintaining uniqueness across chunks can affect ingesting performance dramatically.

The most common approach to fix the issue with the primary key is to create a composite key and include the time dimension as proposed in the question. If the index on the time dimension is not needed (no queries only on time is expected), then the index on time dimension can be avoided:

CREATE TABLE event_hyper (
  id serial,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL,
  PRIMARY KEY (id, ts)
);

SELECT create_hypertable('event_hyper', 'ts', create_default_indexes => FALSE);

It is also possible to use an integer column as the time dimension. It is important that such column has time dimension properties: the value is increasing over time, which is important for insert performance, and queries will select a time range, which is critical for query performance over large database. The common case is for storing unix epoch.

Since id in event_hyper is SERIAL, it will increase with time. However, I doubt the queries will select the range on it. For completeness SQL will be:

CREATE TABLE event_hyper (
  id serial PRIMARY KEY,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL
);

SELECT create_hypertable('event_hyper', 'id', chunk_time_interval => 1000000);
Belgian answered 4/2, 2021 at 8:20 Comment(9)
So essentially, what you are suggesting is not to create a primary key at all? How can I use it then with software like ORM that require that a table has a primary key?Boykins
@Boykins I missed to elaborate this part. You are correct that composite primary key is a usual approach in such situation. I expanded my answer and also added another approach, but I don't think it is suitable.Belgian
Yeah, the unix time approach is interesting, but my queries really will be on time. The SERIAL is just there is the conventional auto increment primary key since you can have 2 events exact same time. This is such a bummer.Boykins
There is another issue, that JPA/Hibernate ORM doesn't support Generated Values on composite primary keys.Boykins
@Boykins Can you elaborate your question? In your table schema you use PostgreSQL type SERIAL, which works inside the database. I don't see how ORM is involved here.Belgian
@Boykins I see disconnection between what you need and your question. The question is focused on requiring uniqueness on time dimension, and it is not required as I answered. It might be valuable to ask another question where you describe the requirements of your ORM.Belgian
Well the database is only half the story. You need the client to work well with it too. Usually, with JPA/Hibernate what you typically do is set a GeneratedValue annotation to tell it what sequence to use. So what the ORM does underneath the hood is call the sequence generator to get the nextval(). With this composite key it just becomes a bit messier. I managed to work around it (I think).Boykins
Yeah, your answer is fine for the purposes of the question. Just ranting that unfortunately using TimescaleDB is not as clean as advertised.Boykins
@Boykins ORMs unfortunately don't count for partitioned databases in general. There were several issues discussed in TimescaleDB repo and support Slack, e.g., about Django. There are tradeoffs in the partitioning case. I hope I explained it in my answer.Belgian
A
1

To build on @k_rus 's answer, it seems like the generated primary key here is not actually what you're looking for. What meaning does that id have? Isn't it just identifying a unique details, ts combination? Or can there meaningfully be two values that have the same timestamp and the same details but different ids that actually has some sort of semantic meaning. It seems to me that that is somewhat nonsensical, in which case, I would do a primary key on (details, ts) which should provide you the uniqueness condition that you need. I do not know if your ORM will like this, they tend to be overly dependent on generated primary keys because, among other things, not all databases support composite primary keys. But in general, my advice for cases like this is to actually use a composite primary key with logical meaning.

Now if you actually care about multiple messages with the same details at the same timestamp, I might suggest a table structure something like

CREATE TABLE event_hyper (
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL,
  count int,
  PRIMARY KEY (details, ts)
);

with which you can do an INSERT ON CONFLICT DO UPDATE in order to increment it.

I wish that ORMs were better about doing this sort of thing, but you can usually trick ORMs into reading from other tables (or a view over them because then they think they can't update records there etc, which is why they need to have the generated PK). Then it just means that there's a little bit of custom ingest code to write that inserts into the hypertable. It's often better to do this anyway because, in general, I've found that ORMs don't always follow best practices for high volume inserts, and often don't use bulk loading techniques.

So a table like that, with a view that just select's * from the table should then allow you to use the ORM for reads, write a very small amount of custom code to do ingest into the timeseries table and voila - it works. The rest of your relational model, which is the part that the ORM excels at doing can live in the ORM and then have a minor integration here with a bit of custom SQL and a few custom methods.

Agnes answered 6/2, 2021 at 20:44 Comment(8)
Just because you don't understand the use case doesn't mean it is nonsensical. My details was just there to create a simple minimal example. In reality I have around 10 columns. Creating a Primary Key on all details is impractical and just wrong. I don't care about the uniqueness of the details, I care that I have an entry for each individual event, inserted as fast as possible, and then have TimescaleDB do the aggregate tables. If I am going to do the counts myself at application level, then there is no point in using TimescaleDB, I could have just used normal Postgresql.Boykins
Okay, and there's no subset of columns that would serve as a good unique key with timestamp? Often a single device, or some other id will only send one record at a given timestamp and that's the composite key that we recommend using rather than using a generated key where every record has its own unique identifier...for instance, what keys do you group by in the continuous aggregate other than time bucket? That could be a good proxy for this.Agnes
(And I apologize if I came off as dismissive, I didn't mean to be. Also, if there's no unique identifier for these things and there's no reason to have that extra key other than for use with the ORM, the advice to just use it as a table you select from works just fine, and you don't have to have any primary key)Agnes
(I'm also happy to try to understand your use case better if you'd like to provide more info about what the other columns are and what you're trying to solve).Agnes
For instance a camera capturing a live feed of a road with vehicles being detected through object recognition. Two vehicles can easily pass together at the same time, especially in high traffic where vehicles move slowly in sync. If they are the same type there is nothing that distinguishes the entries.Boykins
Then it sounds like you simply don't have a unique constraint on the table...as long as you're not using the ORM to do updates, and this data seems like it should be write once, you should be able to just sort of use it as a view for the ORM and be okay with not having a unique constraint. If the unique constraint doesn't have a logical meaning, I probably just wouldn't use it. (Though, again, I would consider the on conflict do update count way of doing things as an option here, I'm not advocating doing it for fifteen minute buckets, just when 2 cars pass at exactly the same microsecond).Agnes
Well yes I am using an ORM for inserts. It is write once. But since it is a star schema with this table at the centre and a lot of foreign keys, using an ORM (in this case JPA/Hibernate with a Spring Data repository) is very convenient, especially to cache dimensions that are used repeatedly with Ehcache, and the same entities are reused for queries. Is the on conflict update slow?Boykins
Hmmm...yes. I don't know how that will all interact and the type of caching the ORM does, you'd have to play around with it and see if it'll work with that. It might still be able to do some of that here, as the foreign keys should all be coming from other tables. Anyway, the on conflict do update should still be quite fast, it's only going to fire if there's a conflict and it doesn't seem horribly common (ie probably less than 10% of inserts).Agnes
P
0

The limitation is:
Need to make all partition columns (primary & secondary, if any) as a unique key of table.

Refer: https://github.com/timescale/timescaledb/issues/447#issuecomment-369371441

2 choices in my opinion:

  1. partition by a single column, which is a unique key (e.g the primary key),
  2. partition with a 2nd space partition key, need to make the 2 columns a combined unique key,
Proponent answered 5/6, 2022 at 14:21 Comment(0)
V
0

I got the same problem.

The solution was to avoid this field: id: 'id'

Violone answered 11/8, 2022 at 11:46 Comment(0)
D
-1

I think I'm replying a little bit too late, but still. You can try something like this:

CREATE TABLE event_hyper (
  id serial,
  ts timestamp with time zone NOT NULL,
  details varchar(255) NOT NULL 
);                    
SELECT create_hypertable('event_hyper', 'ts', partitioning_column => 'id', number_partitions => X);

Where X is the desirable number of hash partitions by column 'id'.
https://docs.timescale.com/api/latest/hypertable/create_hypertable/#optional-arguments

As you can also notice there's no PRIMARY KEY constraint in table 'event_hyper'.

Output of create_hypertable() operation should be:

     create_hypertable     
---------------------------
 (1,public,event_hyper,t)
Doeskin answered 6/12, 2022 at 10:47 Comment(1)
What is the advantage of specifying the partitioning_column rather than leaving it default?Boykins

© 2022 - 2024 — McMap. All rights reserved.