Postgres 9.3: Sharelock issue with simple INSERT
Asked Answered
B

2

7

Update: Potential solution below

I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.

Problem: I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error

TransactionRollbackError: deadlock detected
DETAIL: Process 26755 waits for ShareLock on transaction 689456; blocked by process 26754. Process 26754 waits for ShareLock on transaction 689467; blocked by process 26755.

I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all.

Thanks for reading!

Example config file:

example_key this_is_the_value
other_example other_value
third example yet_another_value

Table definitions:

    CREATE TABLE keys (
        id SERIAL PRIMARY KEY,
        hash UUID UNIQUE NOT NULL,
        key TEXT);

    CREATE TABLE values (
        id SERIAL PRIMARY KEY,
        hash UUID UNIQUE NOT NULL,
        key TEXT);

    CREATE TABLE keyvalue_pairs (
        id SERIAL PRIMARY KEY,
        file_id INTEGER REFERENCES filenames,
        key_id INTEGER REFERENCES keys,
        value_id INTEGER REFERENCES values);

SQL Statements:

Initially I was trying to use this statement to avoid any exceptions:

    WITH s AS (
        SELECT id, hash, key FROM keys
            WHERE hash = 'hash_value';
    ), i AS (
        INSERT INTO keys (hash, key)
        SELECT 'hash_value', 'key_value'
        WHERE NOT EXISTS (SELECT 1 FROM s)
        returning id, hash, key
    )
    SELECT id, hash, key FROM i
    UNION ALL
    SELECT id, hash, key FROM s;

But even something as simple as this causes the deadlocks:

    INSERT INTO keys (hash, key)
        VALUES ('hash_value', 'key_value')
        RETURNING id;
  • In both cases, if I get an exception thrown because the inserted hash value is not unique, I use savepoints to rollback the change and another statement to just select the id I'm after.
  • I'm using hashes for the unique field, as some of the keys and values are too long to be indexed

Full example of the python code (using psycopg2) with savepoints:

key_value = 'this_key'
hash_val = generate_uuid(value)
try:
    cursor.execute(
        '''
        SAVEPOINT duplicate_hash_savepoint;
        INSERT INTO keys (hash, key)
            VALUES (%s, %s)
            RETURNING id;
        '''
        (hash_val, key_value)
    )

    result = cursor.fetchone()[0]
    cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
    return result
except psycopg2.IntegrityError as e:
    cursor.execute(
        '''
        ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
        '''
    )

    #TODO: Should ensure that values match and this isn't just
    #a hash collision

    cursor.execute(
        '''
        SELECT id FROM keys WHERE hash=%s LIMIT 1;
        '''
        (hash_val,)
    )
    return cursor.fetchone()[0]

Update: So I believe I a hint on another stackexchange site:

Specifically:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time1. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it2, otherwise it will attempt to apply its operation to the updated version of the row.

While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.

Bowerman answered 6/2, 2015 at 14:1 Comment(0)
U
14

It looks like you're in this situation:

  1. The table to INSERT into has a primary key (or unique index(es) of any sort).
  2. Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
  3. The rows to insert come in random order (with regard to the primary key)
  4. The rows are inserted in concurrent transactions.

This situation creates the following opportunity for deadlock:

Assuming there are two sessions, that each started a transaction.

  1. Session #1: insert row with PK 'A'
  2. Session #2: insert row with PK 'B'
  3. Session #1: try to insert row with PK 'B' => Session #1 is put to wait until Session #2 commits or rollbacks
  4. Session #2: try to insert row with PK 'A' => Session #2 is put to wait for Session #1.

Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.

If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.

Upstream answered 6/2, 2015 at 15:9 Comment(4)
Another alternative would be to insert the values always in the same order.Antigorite
Could this happen with inserts using an auto-incremented primary key? If so, you can't order them, so you would have to either insert one at a time, or use serializable isolation and retry on database error. Is that right?Andria
@ChristianLong: it cannot happen with a sequence-based autoincrementing PK because there cannot be conflicting values between sessions. It may still happen if another unique index exists aside from the PK and several sessions want to insert identical values into it in cross-order as shown in the answer.Highoctane
@ChristianLong: and the solutions you mention do indeed work to circumvent deadlocks. I guess you may also retry-on-deadlock if the deadlocks occur rarely under normal workload.Highoctane
S
2

Postgres is known for that type of deadlocks, to be honest. I often encounter such problems when different workers update information about interleaving entities. Recently I had a task of importing a big list of scientific papers metadata from multiple json files. I was using parallel processes via joblib to read from several files at the same time. Deadlocks were hanging all the time on authors(id bigint primary key, name text) table all the time 'cause many files contained papers of the same authors, therefore producing inserts with oftentimes the same authors. I was using insert into authors (id,name) values %s on conflict(id) do nothing, but that was not helping. I tried sorting tuples before sending them to Postgres server, with little success. What really helped me was keeping a list of known authors in a Redis set (accessible to all processes):

if not rexecute("sismember", "known_authors", author_id):
   # your logic...
   rexecute("sadd", "known_authors", author_id)

Which I recommend to everyone. Use Memurai if you are limited to Windows. Sad but true, not a lot of other options for Postgres.

Stairs answered 11/10, 2021 at 23:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.