Python-PostgreSQL psycopg2 interface --> executemany
Asked Answered
F

4

6

I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.

Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).

But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.

When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.

Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.

Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.

Thanks again!

Firewarden answered 28/12, 2008 at 17:51 Comment(11)
I am not sure but I think executemany just iterates over your list of dictionaries (rows) and calls "insert" on each. So it doesn't make a difference if you call execute in a loop or call executemany. Just that the "commit" should not be called in loop, rather once every 100 or 1000, as appropriate.Ketcham
so its like: outerloop->gets 1000 next rows from list -> gives to inner loop -> execute for each-> inner loop exits -> commit -> outerloop continues till data lasts. You can try it out on a 100,000 dataset against the excutemany and check if it makes a difference.Ketcham
JV, so your are saying that an executemany still IPC communicates with postgeSQL for each INSERT? Its the overhead inherent to IPC that I am hoping to eliminate by using executemany; if it doesn't eliminate this, I don't have enough reason to use it. Thanks, but I still need more conviction! -NickFirewarden
executemany prepares a SQL statement so that it can be executed faster if you need to call it often. And please don't use string interpolation (%) in a database query. Use cursor.execute(sql_stmt, (arg1, arg2, ...). This escapes your data properly.Marissamarist
Ok, thanks for the string interpolation info on cursor.execute. So you are saying that executemany prepares the SQL statement so that the next 5 million calls to it will be faster; it seems to me like cursor.execute does this as well. I am still wondering if executemany will send but one IPC call?Firewarden
Hey, I was just wondering. If it is indeed the multiple IPCs that are bothering me, it is only in that it blocks (it does, right?). Do you think I should connect my parser to my db client via a queue and separate them in different threads? This would keep the main prog from blocking, wouldn't it?ThxFirewarden
Hm, PEP 249 says that .executemany() can also be implemented by calling .execute() multiple times. So I guess it depends on the implementation of psycopg2.Marissamarist
On separating consumer and producer by a queue: this could help somewhat when your main thread has some work to do while your db thread is inserting so that you can parallelize the tasks. Watch out that the data in the queue isn't growing too much.Marissamarist
psycopg2 may be poorly documented but the official mailing list is very active, very knowledgeable and the actual maintainer is often the first to reply.Antifederalist
@bortzmeyer: yeah, but it would still be nice to have access to a file. I am sure it would reduce the amount of stupid questions that are mailed to these guys!Firewarden
Postgres 9.5 adds upserts, so you could ignore duplicates with INSERT row ON CONFLICT DO NOTHING.Cauline
J
8

just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:

insert into mytable
select * from (
    select distinct * 
    from scratch
) uniq
where not exists (
    select 1 
    from mytable 
    where mytable.mykey = uniq.mykey
);

This will dedup and runs much faster than any combination of inserts.

-dg

Jemie answered 15/2, 2009 at 13:13 Comment(0)
A
5

I had the same problem and searched here for many days to collect a lot of hints to form a complete solution. Even if the question outdated, I hope this will be useful to others.

1) Forget things about removing indexes/constraints & recreating them later, benefits are marginal or worse.

2) executemany is better than execute as it makes for you the prepare statement. You can get the same results yourself with a command like the following to gain 300% speed:

# To run only once:
sqlCmd = """PREPARE myInsert (int, timestamp, real, text) AS
   INSERT INTO myBigTable (idNumber, date_obs, result, user)
     SELECT $1, $2, $3, $4 WHERE NOT EXISTS
     (SELECT 1 FROM myBigTable WHERE (idNumber, date_obs, user)=($1, $2, $4));"""
curPG.execute(sqlCmd)
cptInsert = 0   # To let you commit from time to time

#... inside the big loop:
curPG.execute("EXECUTE myInsert(%s,%s,%s,%s);", myNewRecord)
allreadyExists = (curPG.rowcount < 1)
if not allreadyExists:
   cptInsert += 1
   if cptInsert % 10000 == 0:
      conPG.commit()

This dummy table example has an unique constraint on (idNumber, date_obs, user).

3) The best solution is to use COPY_FROM and a TRIGGER to manage the unique key BEFORE INSERT. This gave me 36x more speed. I started with normal inserts at 500 records/sec. and with "copy", I got over 18,000 records/sec. Sample code in Python with Psycopg2:

ioResult = StringIO.StringIO() #To use a virtual file as a buffer
cptInsert = 0 # To let you commit from time to time - Memory has limitations
#... inside the big loop:
   print >> ioResult, "\t".join(map(str, myNewRecord))
   cptInsert += 1
   if cptInsert % 10000 == 0:
      ioResult = flushCopyBuffer(ioResult, curPG)
#... after the loop:
ioResult = flushCopyBuffer(ioResult, curPG)

def flushCopyBuffer(bufferFile, cursorObj):
   bufferFile.seek(0)   # Little detail where lures the deamon...
   cursorObj.copy_from(bufferFile, 'myBigTable',
      columns=('idNumber', 'date_obs', 'value', 'user'))
   cursorObj.connection.commit()
   bufferFile.close()
   bufferFile = StringIO.StringIO()
   return bufferFile

That's it for the Python part. Now the Postgresql trigger to not have exception psycopg2.IntegrityError and then all the COPY command's records rejected:

CREATE OR REPLACE FUNCTION chk_exists()
  RETURNS trigger AS $BODY$
DECLARE
    curRec RECORD;
BEGIN
   -- Check if record's key already exists or is empty (file's last line is)
   IF NEW.idNumber IS NULL THEN
      RETURN NULL;
   END IF;
   SELECT INTO curRec * FROM myBigTable
      WHERE (idNumber, date_obs, user) = (NEW.idNumber, NEW.date_obs, NEW.user);
   IF NOT FOUND THEN -- OK keep it
      RETURN NEW;
   ELSE    
      RETURN NULL; -- Oups throw it or update the current record
   END IF;
END;
$BODY$ LANGUAGE plpgsql;

Now link this function to the trigger of your table:

CREATE TRIGGER chk_exists_before_insert
   BEFORE INSERT ON myBigTable FOR EACH ROW EXECUTE PROCEDURE chk_exists();

This seems like a lot of work but Postgresql is a very fast beast when it doesn't have to interpret SQL over and over. Have fun.

Anson answered 15/6, 2012 at 23:24 Comment(0)
A
0

"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."

The question doesn't really make a lot of sense.

Does EVERY block of 1,000 rows fail due to non-unique rows?

Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".

Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?

If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.

Please actually run the actual program with actual database and different size blocks and post the numbers.

Apgar answered 28/12, 2008 at 23:6 Comment(0)
Y
-1

using a MERGE statement instead of an INSERT one would solve your problem.

Ymir answered 24/3, 2009 at 1:32 Comment(1)
PostgreSQL does not support the MERGE statement at all, as of version 8.5 beta 2.Liborio

© 2022 - 2024 — McMap. All rights reserved.