mongodb: insert if not exists
Asked Answered
S

10

194

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don't want to have duplicate documents.
  • I don't want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )

Can someone advise how to do it faster?

Selfimportant answered 10/5, 2010 at 7:33 Comment(0)
K
193

Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.

Before:

{"key":"value", "key2":"Ohai."}

After:

{"key":"value", "key2":"value2", "key3":"value3"}

You can also specify what data you want to write:

data = {"$set":{"key2":"value2"}}

Now your selected document will update the value of key2 only and leave everything else untouched.

Koniology answered 27/5, 2010 at 18:17 Comment(13)
This is almost what I want ! How can I not touch the insertion_date field if the object is already present ?Selfimportant
LeMiz: You can pass $set to the data variable in update to selectively pick what to update.Koniology
can you please give an example of just setting a field on first insert and do not update it if exists? @VanNguyenKeesee
The first part of your answer is wrong, I think. coll.update will replace data unless you use $set. So After will actually be: {'key2':'value2', 'key3':'value3'}Sambo
-1 This answer is dangerous. You find by the value of "key" and then you erase "key", so that subsequently you won't be able to find it again. This is a very unlikely use case.Ahl
You should use $setOnInsert operator! Upsert will even update document if found the query.Jacobina
Note: update with upsert : true will throw an error when used with dot notification in 3.0. use updateOne or updateManyAdulteress
how can we achieve this in pymongo?Billen
for python you should pass upsert argument as a keyword not a dictionary. Refer #5056297Spermaceti
as of now you use updateOne and replaceOne, update is to be deprecatedKathlenekathlin
I want to reject if found else insertCowen
What is key and key2?Kelvinkelwen
If the filtering key (query) doesn't have a unique constraint set in DB, duplicate inserts are possible in case of concurrent upserts. See mongodb.com/docs/manual/reference/method/db.collection.update/…Forseti
E
103

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)

Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.

To turn your pseudocode into a working example:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        filter={
            '_id': document['_id'],
        },
        update={
            '$setOnInsert': {
                'insertion_date': now,
            },
            '$set': {
                'last_update_date': now,
            },
        },
        upsert=True,
    )
Enrique answered 8/7, 2013 at 18:18 Comment(1)
This is correct, you can check for a document matching a filter, and insert something if not found, by using $setOnInsert. Note though that there was a bug where you couldn't $setOnInsert with the _id field - it would say something like "can't Mod the _id field". This was a bug, fixed in v2.5.4 or there abouts. If you see this message or issue, just get the latest version.Obsolescent
N
27

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }
Nevanevada answered 12/12, 2012 at 20:10 Comment(2)
It is now createIndexPrecarious
{"a" : 1}, {unique: true} means that the content of field a has to be unique and no second item can have the same content?Precarious
J
18

You may use Upsert with $setOnInsert operator.

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})
Jacobina answered 14/9, 2014 at 15:53 Comment(3)
docs.mongodb.org/manual/reference/operator/update/setOnInsert/…Glandular
For anyone querying with pymongo the third param should just be true or upsert=True, and not a dictHeartbeat
setOnInsert is the key flag if you dont want to modify the previous entriesPremonish
A
7

Summary

  • You have an existing collection of records.
  • You have a set records that contain updates to the existing records.
  • Some of the updates don't really update anything, they duplicate what you have already.
  • All updates contain the same fields that are there already, just possibly different values.
  • You want to track when a record was last changed, where a value actually changed.

Note, I'm presuming PyMongo, change to suit your language of choice.

Instructions:

  1. Create the collection with an index with unique=true so you don't get duplicate records.

  2. Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/

    Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.

  3. Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.

    Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.

Key Optimization Points:

  • The inserts will vastly speed up your operations in bulk.
  • Retrieving records en masse will speed things up, too.
  • Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).
Amery answered 9/6, 2013 at 16:55 Comment(0)
M
6

I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

Scenario #1 - document with 'name' of 'abc' does not exist: New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.

Scenario #2 - document with 'name' of 'abc' already exists with the following: 'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10. After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.

My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.

Merill answered 15/7, 2010 at 21:38 Comment(0)
U
6

1. Use Update.

Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.

NOTE: This method overrides the entire document when found (From the docs)

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.a. Use $set

If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)... So, if you want to set...

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

Send it as...

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.

Utilize answered 24/4, 2012 at 17:25 Comment(0)
C
4

In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.

Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.

Construction answered 10/5, 2010 at 9:34 Comment(0)
R
4

Method For Pymongo

The Official MongoDB Driver for Python

5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert

  • 95% (estimated) of the records are unmodified from day to day.

The following solution is taken from this core mongoDB function:

db.collection.updateOne(filter, update, options)

Updates a single document within the collection based on the filter.

This is done with this Pymongo's function update_one(filter, new_values, upsert=True)

Code Example:

# importing pymongo's MongoClient
from pymongo import MongoClient
 
conn = MongoClient('localhost', 27017)
db = conn.databaseName
 
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
 
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
 
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)

What upsert=True Do?

  • Creates a new document if no documents match the filter.
  • Updates a single document that matches the filter.
Radar answered 15/2, 2022 at 13:54 Comment(0)
B
-1

I do propose the using of await now.

Benzel answered 5/10, 2022 at 0:37 Comment(2)
You know about the commenting privilege which you do not have, so well that you can even put it into words. You are aware of the rule meta.stackexchange.com/questions/214173/… . In that situation please do not decide to misuse a different mechanism (an answer) for something it is not meant for and which you are not allowed yet to do.Dorita
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewExcrement

© 2022 - 2024 — McMap. All rights reserved.