How to use RETURNING with ON CONFLICT in PostgreSQL?
Asked Answered
T

10

343

I have the following UPSERT in PostgreSQL 9.5:

INSERT INTO chats ("user", "contact", "name") 
           VALUES ($1, $2, $3), 
                  ($2, $1, NULL) 
ON CONFLICT("user", "contact") DO NOTHING
RETURNING id;

If there are no conflicts it returns something like this:

----------
    | id |
----------
  1 | 50 |
----------
  2 | 51 |
----------

But if there are conflicts it doesn't return any rows:

----------
    | id |
----------

I want to return the new id columns if there are no conflicts or return the existing id columns of the conflicting columns.
Can this be done? If so, how?

Tachograph answered 10/1, 2016 at 17:25 Comment(4)
Use ON CONFLICT UPDATE so there is a change to the row. Then RETURNING will capture it.Bax
@GordonLinoff What if there's nothing to update?Palocz
If there is nothing to update, it means there was no conflict so it just inserts the new values and return their idTachograph
You'll find other ways here. I'd love to know the difference between the two in terms of performance though.Cush
W
156

I had exactly the same problem, and I solved it using 'do update' instead of 'do nothing', even though I had nothing to update. In your case it would be something like this:

INSERT INTO chats ("user", "contact", "name") 
       VALUES ($1, $2, $3), 
              ($2, $1, NULL) 
ON CONFLICT("user", "contact") 
DO UPDATE SET 
    name=EXCLUDED.name 
RETURNING id;

This query will return all the rows, regardless they have just been inserted or they existed before.

Whitherward answered 31/5, 2016 at 10:21 Comment(5)
One problem with this approach is, that the primary key's sequence number is incremented upon every conflict (bogus update), which basically means that you may end up with huge gaps in the sequence. Any ideas how to avoid that?Fort
@Mischa: so what? Sequences are never guaranteed to be gapless in the first place and gaps don't matter (and if they do, a sequence is the wrong thing to do)Journalism
I would not advise to use this in most cases. I added an answer why.Panelist
This answer doesn't appear to achieve the DO NOTHING aspect of the original question -- for me it appears to update the non-conflict field (here, "name") for all rows.Insulting
As discussed in the very long answer below, using "Do Update" for a field that has not changed is not a "clean" solution and can cause other problems.Atahualpa
P
419

The currently accepted answer seems ok for a single conflict target, few conflicts, small tuples and no triggers. It avoids concurrency issue 1 (see below) with brute force. The simple solution has its appeal, the side effects may be less important.

For all other cases, though, do not update identical rows without need. Even if you see no difference on the surface, there are various side effects:

  • It might fire triggers that should not be fired.

  • It write-locks "innocent" rows, possibly incurring costs for concurrent transactions.

  • It might make the row seem new, though it's old (transaction timestamp).

  • Most importantly, with PostgreSQL's MVCC model UPDATE writes a new row version for every target row, no matter whether the row data changed. This incurs a performance penalty for the UPSERT itself, table bloat, index bloat, performance penalty for subsequent operations on the table, VACUUM cost. A minor effect for few duplicates, but massive for mostly dupes.

Plus, sometimes it is not practical or even possible to use ON CONFLICT DO UPDATE. The manual:

For ON CONFLICT DO UPDATE, a conflict_target must be provided.

A single "conflict target" is not possible if multiple indexes / constraints are involved.
Related solution for multiple, mutually exclusive partial indexes:

Or a way to deal with multiple unique constraints:

Back on the topic, you can achieve (almost) the same without empty updates and side effects. Some of the following solutions also work with ON CONFLICT DO NOTHING (no "conflict target"), to catch all possible conflicts that might arise - which may or may not be desirable.

Without concurrent write load

WITH input_rows(usr, contact, name) AS (
   VALUES
      (text 'foo1', text 'bar1', text 'bob1')  -- type casts in first row
    , ('foo2', 'bar2', 'bob2')
    -- more?
   )
, ins AS (
   INSERT INTO chats (usr, contact, name) 
   SELECT * FROM input_rows
   ON CONFLICT (usr, contact) DO NOTHING
   RETURNING id  --, usr, contact              -- return more columns?
   )
SELECT 'i' AS source                           -- 'i' for 'inserted'
     , id  --, usr, contact                    -- return more columns?
FROM   ins
UNION  ALL
SELECT 's' AS source                           -- 's' for 'selected'
     , c.id  --, usr, contact                  -- return more columns?
FROM   input_rows
JOIN   chats c USING (usr, contact);           -- columns of unique index

The source column is an optional addition to demonstrate how this works. You may actually need it to tell the difference between both cases (another advantage over empty writes).

The final JOIN chats works because newly inserted rows from an attached data-modifying CTE are not yet visible in the underlying table. (All parts of the same SQL statement see the same snapshots of underlying tables.)

Since the VALUES expression is free-standing (not directly attached to an INSERT) Postgres cannot derive data types from the target columns and you may have to add explicit type casts. The manual:

When VALUES is used in INSERT, the values are all automatically coerced to the data type of the corresponding destination column. When it's used in other contexts, it might be necessary to specify the correct data type. If the entries are all quoted literal constants, coercing the first is sufficient to determine the assumed type for all.

The query itself (not counting the side effects) may be a bit more expensive for few dupes, due to the overhead of the CTE and the additional SELECT (which should be cheap since the perfect index is there by definition - a unique constraint is implemented with an index).

May be (much) faster for many duplicates. The effective cost of additional writes depends on many factors.

But there are fewer side effects and hidden costs in any case. It's most probably cheaper overall.

Attached sequences are still advanced, since default values are filled in before testing for conflicts.

About CTEs:

With concurrent write load

Assuming default READ COMMITTED transaction isolation. Related:

The best strategy to defend against race conditions depends on exact requirements, the number and size of rows in the table and in the UPSERTs, the number of concurrent transactions, the likelihood of conflicts, available resources and other factors ...

Concurrency issue 1

If a concurrent transaction has written to a row which your transaction now tries to UPSERT, your transaction has to wait for the other one to finish.

If the other transaction ends with ROLLBACK (or any error, i.e. automatic ROLLBACK), your transaction can proceed normally. Minor possible side effect: gaps in sequential numbers. But no missing rows.

If the other transaction ends normally (implicit or explicit COMMIT), your INSERT will detect a conflict (the UNIQUE index / constraint is absolute) and DO NOTHING, hence also not return the row. (Also cannot lock the row as demonstrated in concurrency issue 2 below, since it's not visible.) The SELECT sees the same snapshot from the start of the query and also cannot return the yet invisible row.

Any such rows are missing from the result set (even though they exist in the underlying table)!

This may be ok as is. Especially if you are not returning rows like in the example and are satisfied knowing the row is there. If that's not good enough, there are various ways around it.

You can check the row count of the output and repeat the statement if it does not match the row count of the input. May be good enough for the rare case. The point is to start a new query (can be in the same transaction), which will then see the newly committed rows.

Or check for missing result rows within the same query and overwrite those with the brute force trick demonstrated in Alextoni's answer.

WITH input_rows(usr, contact, name) AS ( ... )  -- see above
, ins AS (
   INSERT INTO chats AS c (usr, contact, name) 
   SELECT * FROM input_rows
   ON     CONFLICT (usr, contact) DO NOTHING
   RETURNING id, usr, contact                   -- we need unique columns for later join
   )
, sel AS (
   SELECT 'i'::"char" AS source                 -- 'i' for 'inserted'
        , id, usr, contact
   FROM   ins
   UNION  ALL
   SELECT 's'::"char" AS source                 -- 's' for 'selected'
        , c.id, usr, contact
   FROM   input_rows
   JOIN   chats c USING (usr, contact)
   )
, ups AS (                                      -- RARE corner case
   INSERT INTO chats AS c (usr, contact, name)  -- another UPSERT, not just UPDATE
   SELECT i.*
   FROM   input_rows i
   LEFT   JOIN sel   s USING (usr, contact)     -- columns of unique index
   WHERE  s.usr IS NULL                         -- missing!
   ON     CONFLICT (usr, contact) DO UPDATE     -- we've asked nicely the 1st time ...
   SET    name = c.name                         -- ... this time we overwrite with old value
   -- SET name = EXCLUDED.name                  -- alternatively overwrite with *new* value
   RETURNING 'u'::"char" AS source              -- 'u' for updated
           , id  --, usr, contact               -- return more columns?
   )
SELECT source, id FROM sel
UNION  ALL
TABLE  ups;

It's like the query above, but we add one more step with the CTE ups, before we return the complete result set. That last CTE will do nothing most of the time. Only if rows go missing from the returned result, we use brute force.

More overhead, yet. The more conflicts with pre-existing rows, the more likely this will outperform the simple approach.

One side effect: the 2nd UPSERT writes rows out of order, so it re-introduces the possibility of deadlocks (see below) if three or more transactions writing to the same rows overlap. If that's a problem, you need a different solution - like repeating the whole statement as mentioned above.

Concurrency issue 2

If concurrent transactions can write to involved columns of affected rows, and you have to make sure the rows you found are still there at a later stage in the same transaction, you can lock existing rows cheaply in the CTE ins (which would otherwise go unlocked) with:

...
ON CONFLICT (usr, contact) DO UPDATE
SET name = name WHERE FALSE  -- never executed, but still locks the row
...

And add a locking clause to the SELECT as well, like FOR UPDATE.

This makes competing write operations wait till the end of the transaction, when all locks are released. So be brief.

More details and explanation:

Deadlocks?

Defend against deadlocks by inserting rows in consistent order. See:

Data types and casts

Existing table as template for data types ...

Explicit type casts for the first row of data in the free-standing VALUES expression may be inconvenient. There are ways around it. You can use any existing relation (table, view, ...) as row template. The target table is the obvious choice for the use case. Input data is coerced to appropriate types automatically, like in the VALUES clause of an INSERT:

WITH input_rows AS (
  (SELECT usr, contact, name FROM chats LIMIT 0)  -- only copies column names and types
   UNION ALL
   VALUES
      ('foo1', 'bar1', 'bob1')  -- no type casts here
    , ('foo2', 'bar2', 'bob2')
   )
   ...

This does not work for some data types. See:

... and names

This also works for all data types.

While inserting into all (leading) columns of the table, you can omit column names. Assuming table chats in the example only consists of the 3 columns used in the UPSERT:

WITH input_rows AS (
   SELECT * FROM (
      VALUES
      ((NULL::chats).*)         -- copies whole row definition
      ('foo1', 'bar1', 'bob1')  -- no type casts needed
    , ('foo2', 'bar2', 'bob2')
      ) sub
   OFFSET 1
   )
   ...

Aside: don't use reserved words like "user" as identifier. That's a loaded footgun. Use legal, lower-case, unquoted identifiers. I replaced it with usr.

Panelist answered 14/2, 2017 at 4:27 Comment(37)
Is this an efficient way to implement a "get or create" operation for a single entry or is this solution fit only for should I only use this for inserting batches?Drop
@the_drow: It works for a single entry, too. But you can have it a bit simpler for the corner case. Follow this link I provided above.Panelist
You imply this method will not create gaps in the serials, but they are: INSERT ... ON CONFLICT DO NOTHING does increment the serial each time from what I can seeVicar
@harmic: I was wrong to imply that. Attached sequences are advanced either way. I removed the remark accordingly.Panelist
not that it matter that much, but why is it that serials are incremented? and is there no way to avoid this?Rinderpest
@salient: Like I added above: column default values are filled in before testing for conflicts and sequences are never rolled back, to avoid conflicts with concurrent writes.Panelist
I'm not sure how to remove the optional source part without breaking the code. I just want to return the id. Any ideas?Featherbedding
@sudosensei: Remove 'i' AS source, and 's' AS source, , that's all.Panelist
Incredible. Works like a charm and easy to understand once you look at it carefully. I still wish ON CONFLICT SELECT... where a thing though :)Ylem
@Roshambo: Yep, that would be a lot more elegant. (I added alternatives to explicit type casts while being here.)Panelist
@ErwinBrandstetter I get the following error ERROR: cannot perform INSERT RETURNING on relation "table" HINT: You need an unconditional ON INSERT DO INSTEAD rule with a RETURNING clause.Mckie
Solved: I happened to have a rule on that table. Dropped that and it worked. ThanksMckie
NOTE: Looks like it's not possible to avoid type casts for uuid type columns: ERROR: UNION types uuid and text cannot be matchedMckie
@AndreiStalbe: You are obviously using text as input, while my examples use untyped string literals. Subtly different things. Examples: https://mcmap.net/q/28907/-i-have-a-delete-insert-cte-that-fails-in-a-strange-manner, https://mcmap.net/q/28908/-plpgsql-using-dynamic-table-name-in-declare-statement, https://mcmap.net/q/28909/-interval-days-in-postgresql-with-two-parameters If you still need more explanation, consider a new question ...Panelist
@ErwinBrandstetter am I right in reading "CTE" in your answer as "Common Table Expression"? (If so, perhaps you could edit your answer to clarify that and link to some documentation?)Insulting
@ErwinBrandstetter awesome answer, thanks! Do you come up with any idea to dynamically cast data types in VALUES? I wish I don't have to explicitly cast the to the already know table cols, where I'm inserting into.Hamlen
@suricactus: Did you see the last chapter in the answer above? There are solutions for that issue precisely.Panelist
@ErwinBrandstetter I was too sleepy and missed that part, sorry. Perfect!Hamlen
Incredible. Postgres' creators seem to be torturing users. Why not just simply make returning clause always return values, regardless of whether there were inserts or not?Appropriation
When should the cte solution (posted here) be used over the loop solution (posted on dba.SE)?Exotoxin
@No_name: there are multiple CTE solutions here and I have posted a lot on dba.SE ... Maybe ask your question as question, with all relevant details of your setup and unambiguous references to solutions in question.Panelist
"Concurrency issue 2" in this answer, and dba.stackexchange.com/a/213625 . Unfortunately do not have any specifics on my setup. Just curious at a high level what the tradeoffs are. It seems this question and the question from dba.SE are very similar, and both solutions mention locking the row.Exotoxin
@No_name, the difference: here we make sure a row with the key value is there and stays there for the duration of the transaction. Over there we actually retrieve the row - which cannot be done in the same statement for the corner case.Panelist
Brilliant answer, only enhanced by the term "That's a loaded footgun.", which I shall be using. Thank you.Halcomb
Do I understand correctly that not only a CTE has the concurrency issues, but also running two separate statements in a transaction would have these issues?Hic
@Bergi: Yes. And two separate statements typically have a bigger problem, as the time frame for race conditions is typically bigger between statements than between CTEs and the main command. But a new statement sees a newer snapshot, that can help in certain situations, as described above.Panelist
@ErwinBrandstetter Hm. Doesn't your answer say "Start a new query (can be in the same transaction), which will then see the newly committed rows." though? That sounds to me as if a SELECT statement run after the INSERT … ON CONFLICT DO NOTHING will always see a row (possibly the one from the other transaction though).Hic
@Hic Yes, that's one of those "certain situations" I mentioned. But there are multitple distinct concurrency issues involved (and discussed above) and, generally speaking, two separate statements are more susceptible to race conditions. The cited bit overcomes one particular concurrency issue by starting a new command to see an updated snapshot of underlying tables. The matter is too complex to be discussed in comments ...Panelist
@ErwinBrandstetter …so I've asked #69052505 now. Looking forward to an elaboration in a full answer :-)Hic
See Paul's answer. It has much more simple and performant solution. https://mcmap.net/q/28882/-how-to-use-returning-with-on-conflict-in-postgresqlPlanetesimal
@ErwinBrandstetter what about Paul's answer? https://mcmap.net/q/28882/-how-to-use-returning-with-on-conflict-in-postgresqlPlanetesimal
@frc129: Paul's answer is good. If most input rows already exist, it should also perform well. If more input rows are actually inserted (the common case), it gets cheaper to return those directly instead of fetching them from the table again with the following SELECT.Panelist
What about João's answer? https://mcmap.net/q/28882/-how-to-use-returning-with-on-conflict-in-postgresql Is it safe?Crybaby
@Gili: No, it's not safe under concurrent write load. It tries to fetch conflicting rows from the table, but as I explained in detail, that fails within the same query as concurrently written rows are not visible, yet.Panelist
Just wanted to thank you for such an insight into the problem, really appreciate it!Solidify
@ErwinBrandstetter would it be possible to simplify this pattern using Postgres 15's new MERGE command? It seems very similar, but maybe that is wishful thinking on my part. Reason I ask is that repeating this pattern all over the place is painful and hard to read.Allowedly
@HansBrende: I expect not. MERGE cannot replace INSERT ... ON CONFLICT. Citing the manual: "There are a variety of differences and restrictions between the two statement types and they are not interchangeable."Panelist
W
156

I had exactly the same problem, and I solved it using 'do update' instead of 'do nothing', even though I had nothing to update. In your case it would be something like this:

INSERT INTO chats ("user", "contact", "name") 
       VALUES ($1, $2, $3), 
              ($2, $1, NULL) 
ON CONFLICT("user", "contact") 
DO UPDATE SET 
    name=EXCLUDED.name 
RETURNING id;

This query will return all the rows, regardless they have just been inserted or they existed before.

Whitherward answered 31/5, 2016 at 10:21 Comment(5)
One problem with this approach is, that the primary key's sequence number is incremented upon every conflict (bogus update), which basically means that you may end up with huge gaps in the sequence. Any ideas how to avoid that?Fort
@Mischa: so what? Sequences are never guaranteed to be gapless in the first place and gaps don't matter (and if they do, a sequence is the wrong thing to do)Journalism
I would not advise to use this in most cases. I added an answer why.Panelist
This answer doesn't appear to achieve the DO NOTHING aspect of the original question -- for me it appears to update the non-conflict field (here, "name") for all rows.Insulting
As discussed in the very long answer below, using "Do Update" for a field that has not changed is not a "clean" solution and can cause other problems.Atahualpa
S
60
WITH e AS(
    INSERT INTO chats ("user", "contact", "name") 
           VALUES ($1, $2, $3), 
                  ($2, $1, NULL) 
    ON CONFLICT("user", "contact") DO NOTHING
    RETURNING id
)
SELECT * FROM e
UNION
    SELECT id FROM chats WHERE user=$1, contact=$2;

The main purpose of using ON CONFLICT DO NOTHING is to avoid throwing error, but it will cause no row returns. So we need another SELECT to get the existing id.

In this SQL, if it fails on conflicts, it will return nothing, then the second SELECT will get the existing row; if it inserts successfully, then there will be two same records, then we need UNION to merge the result.

Surbased answered 4/6, 2020 at 22:11 Comment(5)
This solution works well and avoids doing an unnecessary write (update) to the DB!! Nice!Meridethmeridian
Woo... Thanks, mate. Thanks a ton. This worked like a charm. I had a dependency where I need the ids to be inserted in another CTE.Squalor
It doesn't deal with the technical details of the original question - but it actually gets the job done nicely. Simple and adaptable - tnx!Chromogenic
Why would I be getting no rows from this every now and then? Shouldn't it always return something?Euchromatin
@Euchromatin Have you tried running the original query? or a sample query to see why you'd want to use a CTE then a UNION select after?Aylward
R
25

Upsert, being an extension of the INSERT query can be defined with two different behaviors in case of a constraint conflict: DO NOTHING or DO UPDATE.

INSERT INTO upsert_table VALUES (2, 6, 'upserted')
   ON CONFLICT DO NOTHING RETURNING *;

 id | sub_id | status
----+--------+--------
 (0 rows)

Note as well that RETURNING returns nothing, because no tuples have been inserted. Now with DO UPDATE, it is possible to perform operations on the tuple there is a conflict with. First note that it is important to define a constraint which will be used to define that there is a conflict.

INSERT INTO upsert_table VALUES (2, 2, 'inserted')
   ON CONFLICT ON CONSTRAINT upsert_table_sub_id_key
   DO UPDATE SET status = 'upserted' RETURNING *;

 id | sub_id |  status
----+--------+----------
  2 |      2 | upserted
(1 row)
Rave answered 25/5, 2016 at 0:13 Comment(2)
Nice way to always get the affected row id, and know whether it was an insert or upsert. Just what I needed.Challenging
This is still using the "Do Update", which the disadvantages have already been discussed.Atahualpa
A
25

For insertions of a single item, I would probably use a coalesce when returning the id:

WITH new_chats AS (
    INSERT INTO chats ("user", "contact", "name")
    VALUES ($1, $2, $3)
    ON CONFLICT("user", "contact") DO NOTHING
    RETURNING id
) SELECT COALESCE(
    (SELECT id FROM new_chats),
    (SELECT id FROM chats WHERE user = $1 AND contact = $2)
);

For insertions of multiples items, you can put the values on a temporary WITH and reference them later:

WITH chats_values("user", "contact", "name") AS (
    VALUES ($1, $2, $3),
           ($4, $5, $6)
), new_chats AS (
    INSERT INTO chats ("user", "contact", "name")
    SELECT * FROM chat_values
    ON CONFLICT("user", "contact") DO NOTHING
    RETURNING id
) SELECT id
    FROM new_chats
   UNION
  SELECT chats.id
    FROM chats, chats_values
   WHERE chats.user = chats_values.user
     AND chats.contact = chats_values.contact;

Note: as per Erwin's comment, on the off-chance your application will try to 'upsert' the same data concurrently (two workers trying to insert <unique_field> = 1 at the same time), and such data does not exist on the table yet, you should change the isolation level of your transaction before running the 'upsert':

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

On that specific case, one of the two transactions will be aborted. If this case happens a lot on your application, you might want to just do 2 separate queries, otherwise, handling the error and re-doing the query is just easier and faster.

Adherent answered 28/2, 2020 at 0:14 Comment(6)
Important to rename to the Coalesce to id. ... SELECT COALESCE ( ... ) AS idWilie
@Wilie agree that you should add it if you want to have 100% 'compliant' verison of 'INSERT ... RETURNING ...', but most of the times the result is going through a SQL client, which ignores column names. Leaving as is for simplicity.Wivern
Not only less DB impact (avoiding locks & writes), but this COALESCE approach notably boosted performance and is still easy to read. Great solution!Except
My favorite solutionMangonel
Not safe under concurrent write load. You would need to start a new query to see rows that have been committed concurrently.Panelist
Fair enough, but since that's such an edge case, I would rather just set the isolation mode to 'SERIALIZABLE' and handle the error as any other. I'll add a note to my response though.Wivern
H
11

If all you want is to upsert a single row

Then you can simplify things rather significantly by using a simple EXISTS check:

WITH
  extant AS (
    SELECT id FROM chats WHERE ("user", "contact") = ($1, $2)
  ),
  inserted AS (
    INSERT INTO chats ("user", "contact", "name")
    SELECT $1, $2, $3
    WHERE NOT EXISTS (SELECT FROM extant)
    RETURNING id
  )
SELECT id FROM inserted
UNION ALL
SELECT id FROM extant

Since there is no ON CONFLICT clause, there is no update – only an insert, and only if necessary. So no unnecessary updates, no unnecessary write locks, no unnecessary sequence increments. No casts required either.

If the write lock was a feature in your use case, you can use SELECT FOR UPDATE in the extant expression.

And if you need to know whether a new row was inserted, you can add a flag column in the top-level UNION:

SELECT id, TRUE AS inserted FROM inserted
UNION ALL
SELECT id, FALSE FROM extant
Hydroscope answered 11/12, 2021 at 5:58 Comment(7)
I'm getting the error: Caused by: org.postgresql.util.PSQLException: ERROR: INSERT has more target columns than expressions Hint: The insertion source is a row expression containing the same number of columns expected by the INSERT. Did you accidentally use extra parentheses?Suitor
The hint already said exactly what was wrong. Fixed; dunno why I put those parentheses in there.Hydroscope
Is this safe for concurrent writes? Will it fail if 2 workers try to insert the same row if the first query returns empty for both?Metacarpus
@Metacarpus I’m afraid I cannot answer, not for sure, though I am inclined to assume it is not safe (because I don’t see why it would be…). However, I can nevertheless say for sure that if indeed necessary then you can make it safe by doing what I mentioned: use SELECT FOR UPDATE in the extant CTE to take a write lock up front. That will cause a concurrent execution of the same query to block before the attempted select of the existing row, and only allow it to proceed after the earlier execution is done writing to the table.Hydroscope
It does not work for me with concurrent writes and serializable isolation level.Obie
@Obie Is that still true if you use SELECT FOR UPDATE?Hydroscope
I do not remember anymore. I think I tried with it as well.Obie
P
6

Building on Erwin's answer above (terrific answer btw, never would have gotten here without it!), this is where I ended up. It solves a couple additional potential problems - it allows for duplicates (which would otherwise throw an error) by doing a select distinct on the input set, and it ensures that the returned IDs exactly match the input set, including the same order and allowing duplicates.

Additionally, and one part that was important for me, it significantly reduces the number of unnecessary sequence advancements using the new_rows CTE to only try inserting the ones that aren't already in there. Considering the possibility of concurrent writes, it will still hit some conflicts in that reduced set, but the later steps will take care of that. In most cases, sequence gaps aren't a big deal, but when you're doing billions of upserts, with a high percentage of conflicts, it can make the difference between using an int or a bigint for the ID.

Despite being big and ugly, it performs extremely well. I tested it extensively with millions of upserts, high concurrency, high numbers of collisions. Rock solid.

I've packaged it as a function, but if that's not what you want it should be easy to see how to translate to pure SQL. I've also changed the example data to something simple.

CREATE TABLE foo
(
  bar varchar PRIMARY KEY,
  id  serial
);
CREATE TYPE ids_type AS (id integer);
CREATE TYPE bars_type AS (bar varchar);

CREATE OR REPLACE FUNCTION upsert_foobars(_vals bars_type[])
  RETURNS SETOF ids_type AS
$$
BEGIN
  RETURN QUERY
    WITH
      all_rows AS (
        SELECT bar, ordinality
        FROM UNNEST(_vals) WITH ORDINALITY
      ),
      dist_rows AS (
        SELECT DISTINCT bar
        FROM all_rows
      ),
      new_rows AS (
        SELECT d.bar
        FROM dist_rows d
             LEFT JOIN foo f USING (bar)
        WHERE f.bar IS NULL
      ),
      ins AS (
        INSERT INTO foo (bar)
          SELECT bar
          FROM new_rows
          ORDER BY bar
          ON CONFLICT DO NOTHING
          RETURNING bar, id
      ),
      sel AS (
        SELECT bar, id
        FROM ins
        UNION ALL
        SELECT f.bar, f.id
        FROM dist_rows
             JOIN foo f USING (bar)
      ),
      ups AS (
        INSERT INTO foo AS f (bar)
          SELECT d.bar
          FROM dist_rows d
               LEFT JOIN sel s USING (bar)
          WHERE s.bar IS NULL
          ORDER BY bar
          ON CONFLICT ON CONSTRAINT foo_pkey DO UPDATE
            SET bar = f.bar
          RETURNING bar, id
      ),
      fin AS (
        SELECT bar, id
        FROM sel
        UNION ALL
        TABLE ups
      )
    SELECT f.id
    FROM all_rows a
         JOIN fin f USING (bar)
    ORDER BY a.ordinality;
END
$$ LANGUAGE plpgsql;
Paction answered 19/3, 2021 at 7:32 Comment(0)
G
3

The simplest, most performant solution is

BEGIN;

INSERT INTO chats ("user", contact, name) 
    VALUES ($1, $2, $3), ($2, $1, NULL) 
ON CONFLICT ("user", contact) DO UPDATE
  SET name = excluded.name
  WHERE false
RETURNING id;

SELECT id
FROM chats
WHERE (user, contact) IN (($1, $2), ($2, $1));

COMMIT;

The DO UPDATE WHERE false locks but does not update the row, which is a feature, not a bug, since it ensures that another transaction cannot delete the row.

Some comments have want to distinguish between updated and created rows.

In that case, simply add txid_current() = xmin AS created to the select.

Grandfather answered 23/2, 2021 at 16:38 Comment(5)
Why do you even need the DO UPDATE..WHERE false and RETURNING clauses if you're just returning the insertion set ids in the SELECT? In PG 12 the RETURNING clause still returns nothing if there's no UPDATE (per the WHERE false clause)Miltiades
@BrDaHa, I explained that: "locks but does not update the row... it ensures that another transaction cannot delete the row"Grandfather
Yes, you said "DO UPDATE WHERE false locks but does not update the row", I get that part. I was asking why the RETURNING clause is there, when it doesn't actually return anything. Is the RETURNING clause is also needed to prevent deletions?Miltiades
@BrDaHa, oh, yes, it's been a long time since I've looked at this, but I think returning is unnecessary.Grandfather
See dba.stackexchange.com/a/212634/4719 if you have multiple unique constraintsCrybaby
F
1

I modified the amazing answer by Erwin Brandstetter, which won't increment the sequence, and also won't write-lock any rows. I'm relatively new to PostgreSQL, so please feel free to let me know if you see any drawbacks to this method:

WITH input_rows(usr, contact, name) AS (
   VALUES
      (text 'foo1', text 'bar1', text 'bob1')  -- type casts in first row
    , ('foo2', 'bar2', 'bob2')
    -- more?
   )
, new_rows AS (
   SELECT 
     c.usr
     , c.contact
     , c.name
     , r.id IS NOT NULL as row_exists
   FROM input_rows AS r
   LEFT JOIN chats AS c ON r.usr=c.usr AND r.contact=c.contact
   )
INSERT INTO chats (usr, contact, name)
SELECT usr, contact, name
FROM new_rows
WHERE NOT row_exists
RETURNING id, usr, contact, name

This assumes that the table chats has a unique constraint on columns (usr, contact).

Update: added the suggested revisions from spatar (below). Thanks!

Yet another update, per Revinand comment:

WITH input_rows(usr, contact, name) AS (
   VALUES
      (text 'foo1', text 'bar1', text 'bob1')  -- type casts in first row
    , ('foo2', 'bar2', 'bob2')
    -- more?
   )
, new_rows AS (
   INSERT INTO chats (usr, contact, name)
   SELECT 
     c.usr
     , c.contact
     , c.name
   FROM input_rows AS r
   LEFT JOIN chats AS c ON r.usr=c.usr AND r.contact=c.contact
   WHERE r.id IS NULL
   RETURNING id, usr, contact, name
   )
SELECT id, usr, contact, name, 'new' as row_type
FROM new_rows
UNION ALL
SELECT id, usr, contact, name, 'update' as row_type
FROM input_rows AS ir
INNER JOIN chats AS c ON ir.usr=c.usr AND ir.contact=c.contact

I haven't tested the above, but if you're finding that the newly inserted rows are being returned multiple times, then you can either change the UNION ALL to just UNION, or (better), just remove the first query, altogether.

Feer answered 8/5, 2020 at 19:50 Comment(6)
Instead of CASE WHEN r.id IS NULL THEN FALSE ELSE TRUE END AS row_exists just write r.id IS NOT NULL as row_exists. Instead of WHERE row_exists=FALSE just write WHERE NOT row_exists.Isoagglutination
Good solution, but it doesn't answer the question. Your solution returns only inserted rowsFlit
@Flit good point; added the full query beneath.Feer
A Drawback is the massive complexity. I'd sooner setup my data so that I do a query based on unique column values, or another record, which references.Kizer
@Kizer this requires another query to the database, which can slow things down. Also, querying for unique values can be extremely expensive, especially if the table isn't indexed well for your needs. That being said, my solution above is also highly dependent on the indexing strategy to perform well, but I would argue is less impacted than running a query to generate unique values.Feer
I've accepted an update to the rows as an alternative. Using a lot of these workarounds, I've cited as a reason why I'd accept the consequences. All I need is a unique constraint and on-conflict update, and my returning statement works. It's < 10 lines versus 40, but also among those 10, it's much easier to make mistakes in the 40-liner, whereas the < 10 line version has very few [if any] non-essential moving parts.Kizer
T
-2

Update a field to it's current value. In this case the conflicting row's data will not change at all and it will return the id of the conflicting row instead of empty set.

INSERT INTO chats c ("user", "contact", "name") 
       VALUES ($1, $2, $3), 
              ($2, $1, NULL) 
ON CONFLICT("user", "contact") DO UPDATE
SET user = c.user
RETURNING id
Transport answered 22/2, 2023 at 13:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.