Bulk insert with multiprocessing using peewee

N

2

6

I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.

Example: Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.

My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.

Thx for answers and your time

Northnorthwest answered 2/1, 2015 at 18:54 Comment(0)

P

1

You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:

Perrin answered 2/1, 2015 at 19:35 Comment(4)

Ok, after couple of tests, i've ended up with normal save on post and category table + bulk insert on post_to_category table. Didn't expect help from peewee guru, thx m8 ;) – Afterworld 4/1, 2015 at 19:37

According to the FAQ, you shouldn't just put a bunch of links on an answer. – Sharl 13/8, 2016 at 16:0

Fail on my part. Multiple processes with sqlite must be handled differently than postgres or MySQL, so that's one consideration. The other would be to use the atomic context manager to wrap inserts in transactions, rolling back and/or retrying on failure. – Perrin 15/8, 2016 at 1:31

I am still not following the answer. If there are multiple records, which I have to insert and I need to check that the record should not exist. Which is the best method? As per what I am thinking, the only option is to get unique keys not matching from the database table and skip those records while inserting. Is there a better way around? – Creatural 24/10, 2016 at 19:30

P

4

Looked for this myself for a while, but found it!

you can use the on_conflict_replace() or on_conflict_ignore() functions to define behaviour for when a record exists in a table that has a uniqueness constraint.

PriceData.insert_many(values).on_conflict_replace().execute()

or

PriceData.insert_many(values).on_conflict_ignore().execute()

More info under "Upsert" here

Preemption answered 5/10, 2021 at 16:13 Comment(0)

P

1