serial in postgres is being increased even though I added on conflict do nothing
Asked Answered
C

5

58

I'm using Postgres 9.5 and seeing some wired things here.

I've a cron job running ever 5 mins firing a sql statement that is adding a list of records if not existing.

INSERT INTO 
   sometable (customer, balance)
VALUES
    (:customer, :balance)
ON CONFLICT (customer) DO NOTHING

sometable.customer is a primary key (text)

sometable structure is:
id: serial
customer: text
balance: bigint

Now it seems like everytime this job runs, the id field is silently incremented +1. So next time, I really add a field, it is thousands of numbers above my last value. I thought this query checks for conflicts and if so, do nothing but currently it seems like it tries to insert the record, increased the id and then stops.

Any suggestions?

Como answered 13/5, 2016 at 8:15 Comment(7)
This is by design: postgresql.nabble.com/… But you don't need to worry about those gaps, they are not a problem. The only thing a sequence guarantees is that it never generates the same number twice.Hards
If, for some reason, you actually need a gapless sequence (a rare requirement), then Postgres's SERIAL/SEQUENCE support is not what you should be using. See this answer from Craig Ringer.Loup
The gaps in the sequence are a problem if you are partitioning another table based on hashing that key. No way to guarantee a balanced set of partitions. I am facing this problem right now.Tetherball
I feel like someone should have pointed you toward UUID keys as an alternative to serial keys. They very likely would solve your problem.Ashlan
The gaps aren't a problem for me, but I'm wanting to use a smallserial and the insert conflicts are blowing out the max size of the smallint even though in reality there are way fewer than 32k rows. With 32 parallel processes loading data, the conflicts reach the max within a few seconds... (this table has billions of rows, hence the desire to convert a ~100 char varchar down to a smallint since there are only ~20k distinct values)Tinaret
@J.Dimeo: have you ever found a working solution? Having the exact same problemBiogenesis
@Biogenesis unfortunately no. IIRC, I just managed this lookup table of 20k distinct values in memory/in application code :-(Tinaret
L
57

The reason this feels weird to you is that you are thinking of the increment on the counter as part of the insert operation, and therefore the "DO NOTHING" ought to mean "don't increment anything". You're picturing this:

  1. Check values to insert against constraint
  2. If duplicate detected, abort
  3. Increment sequence
  4. Insert data

But in fact, the increment has to happen before the insert is attempted. A SERIAL column in Postgres is implemented as a DEFAULT which executes the nextval() function on a bound SEQUENCE. Before the DBMS can do anything with the data, it's got to have a complete set of columns, so the order of operations is like this:

  1. Resolve default values, including incrementing the sequence
  2. Check values to insert against constraint
  3. If duplicate detected, abort
  4. Insert data

This can be seen intuitively if the duplicate key is in the autoincrement field itself:

CREATE TABLE foo ( id SERIAL NOT NULL PRIMARY KEY, bar text );
-- Insert row 1
INSERT INTO foo ( bar ) VALUES ( 'test' );
-- Reset the sequence
SELECT setval(pg_get_serial_sequence('foo', 'id'), 0, true);
-- Attempt to insert row 1 again
INSERT INTO foo ( bar ) VALUES ( 'test 2' )
     ON CONFLICT (id) DO NOTHING;

Clearly, this can't know if there's a conflict without incrementing the sequence, so the "do nothing" has to come after that increment.

Loup answered 13/5, 2016 at 9:24 Comment(9)
Great explanation! Do you know if this is documented somewhere in PostgreSQL's doc? I searched but couldn't find anything.Intemerate
@Intemerate The Serial pseudo-type is documented here and explains that it's actually implemented as a default on the column, calling a function; but the documentation for ON CONFLICT clauses doesn't explicitly mention that defaults will be resolved first. Perhaps you could propose an edit.Loup
Great answer. I noticed there were large gaps in our primary keys. I wasn't sure if there was some error being made on our part, but we're making heavy use of ON CONFLICT. So this explains the observations. Also note that there is a CYCLE option for sequences that enables wrap-around behavior when it exceeds the max.Europe
This also happens with GENERATED ALWAYS AS IDENTITY columns too, which is annoying because it feels like you get nothing new with the extra verbose syntaxMesomorphic
@Mesomorphic That syntax mostly exists for compliance with the SQL standard, which is why it's so verbose; standard SQL loves long-winded syntax e.g. "FETCH FIRST 10 ROWS ONLY" instead of "LIMIT 10" or "TOP 10". The implementation of "identity" columns is basically the same as "serial" columns, and there's really no reason for it not to behave the same way I've described in this answer.Loup
I believe they added support for that syntax fairly recently, thus it feels like a missed opportunity to do things differently. The existing behavior might be fine for certain workloads, but it causes problems with others (like string deduplication/compression).Mesomorphic
@Mesomorphic Recently in the grand scheme of things, but almost as long ago as I wrote this answer. And as the answer explains, the behaviour is not so much a design decision, as a logical necessity: the values of the new row need to be computed to know whether they cause a conflict. I imagine that calculating some values but not others would be a non-trivial thing to implement, and completely unrelated to how a sequence is attached to a column (which is how Postgres interments both Serial and Identity columns).Loup
It is a consequence of other design decisions, but some of those are so deeply ingrained (in Postgres or in SQL more generally) that changing them would be practically impossible. I don't think there's a general case solution, but it does feel like (perhaps with some other infrastructure, like additional statements or constraints) it should be possible to have mostly gapless identifiers as a special case.Mesomorphic
@Mesomorphic Gapless identifiers are a whole topic to themselves, and would require a lot more than a change to this small edge case. Sequences are deliberately independent of transactions, so you can receive unique values in simultaneous, uncommitted, transactions; a gapless sequence requires the opposite, a mandatory lock, so that you can't create a gap by rolling back a transaction. You've got to generate the value at some point, so what you'd want then probably wouldn't be "on conflict do nothing", it would be "on conflict roll back increment and release lock".Loup
R
5

As already said by @a_horse_with_no_name and @Serge Ballesta serials are always incremented even if INSERT fails.

You can try to "rollback" serial value to maximum id used by changing the corresponding sequence:

SELECT setval('sometable_id_seq', MAX(id), true) FROM sometable;
Rickart answered 13/5, 2016 at 8:39 Comment(1)
Note that passing true as the third parameter to setval is the same as not passing a third parameter. setval('sometable_id_seq', MAX(id), true) is equivalent to setval('sometable_id_seq', MAX(id))Bedsore
O
2

As said by @a_horse_with_no_name, that is by design. Serial type fields are implemented under the hood through sequences, and for evident reasons, once you have gotten a new value from a sequence, you cannot rollback the last value. Imagine the following scenario:

  • sequence is at n
  • A requires a new value : got n+1
  • in a concurrent transaction B requires a new value: got n+2
  • for any reason A rollbacks its transaction - would you feel safe to reset sequence?

That is the reason why sequences (and serial field) just document that in case of rollbacked transactions holes can occur in the returned values. Only unicity is guaranteed.

Offspring answered 13/5, 2016 at 8:28 Comment(6)
Yeah thanks I got that, but in my point of view a +1 doesn't make sense if do nothing is provided at the query. if should check before actually adding whether it needs to add anything. Also I am worried about the gaps. This job runs every 5 mins...Como
so only thing I can do is actually check by myself whether it's needed to add this record by selecting all of them :-/Como
@Christian: there is nothing you need to do. Just ignore the gaps. There is nothing wrong with ID values having (large gaps)Hards
@a_horse_with_no_name Unless you are inserting enough data to overflow a SERIAL in a "short" (purposefully vague) amount of time. I was going to do this for a query that runs ~50 times/second, but I would run out of unique values in roughly 1.36 years. I wish instead of being forced to use BIGSERIAL there was a command like SELECT id FROM table WHERE uniqueCol = 'val' ON EMPTY INSERT INTO table (uniqueCol) VALUES ('val') ON CONFLICT(uniqueCol) DO UPDATE SET uniqueCol = EXCLUDED.uniqueCol RETURNING id. That would significantly reduce the number of wasted id's.Duffie
This sounds like a use case for a UUID rather than a serial key.Ashlan
@Ashlan a UUID is twice as big as a BIGINT and four times as big as an INTEGER. If you have lots of rows, a UUID or even worse VARCHAR column will eat up a lot more space than INTEGER or BIGINT.Mesomorphic
M
0

I have run into this problem as well, where I was trying to use a UNIQUE column alongside an autoincrementing INTEGER or BIGINT to compress/deduplicate long string values that have low cardinality (many repeat values). Simply executing INSERT ... ON CONFLICT DO NOTHING every time was leading to the sequence having far more gaps than useful values, to the point of risking exhaustion/wraparound of sequence numbers.

I could think of two approaches, each with their own tradeoffs:

  1. LOCK the table in EXCLUSIVE mode, then do a SELECT to check for an existing row, followed by a simple INSERT only if necessary. This will pretty much guarantee no gaps in the sequence, but will bottleneck throughput on that table. You might be able to combine this with a local in-memory cache of existing value->ID mappings to avoid the lock where possible, assuming you never modify the rows once inserted. You could also do a SELECT before locking to see if the value already exists and avoid the lock when rows are more likely than not to already exist (this is also known as "double-checked locking").

  2. Don't lock the table; do a SELECT to check for an existing row, then an INSERT .. ON CONFLICT DO NOTHING RETURNING id if necessary, then a final SELECT only if nothing was returned by the INSERT. This can greatly reduce the number of sequence increments, though lots of concurrently running attempts to insert the same value can still create gaps. This is best done in autocommit mode (not inside a transaction) since transactions (which can't go below READ COMMITTED isolation level in Postgres) can delay the visibility of the inserted row, causing more unnecessary sequence increments.

Mesomorphic answered 4/6, 2024 at 19:18 Comment(0)
L
-4

Well there is technique that allows you to do stuff like that. They call insert mutex. It is old old old, but it works.

https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/

Generally idea is that you do INSERT SELECT and if your values are duplicating the SELECT does not return any results that of course prevents INSERT and the index is not incremented. Bit of mind boggling, but perfectly valid and performant.

This of course completely ignores ON DUPLICATE but one gets back control over the index.

Lumberman answered 26/7, 2018 at 16:24 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.