MongoDB relationships: embed or reference?

Asked 21/3, 2011 at 2:19 Answered 2/9, 2020 at 13:50

635

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?

A question with some comments, like stackoverflow, would have a structure like this:

Question
    title = 'aaa'
    content = 'bbb'
    comments = ???

At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:

Question
    title = 'aaa'
    content = 'bbb'
    comments = [ { content = 'xxx', createdAt = 'yyy'}, 
                 { content = 'xxx', createdAt = 'yyy'}, 
                 { content = 'xxx', createdAt = 'yyy'} ]

It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)

Do I have to use ref rather than embed? Do I then have to create a new collection for comments?

Endbrain answered 21/3, 2011 at 2:19 Comment(5)

All Mongo objects are created with an _ID, whether you create the field or not. So technically each comment will still have an ID. – Taimi 10/1, 2014 at 5:31

@RobbieGuilfoyle not true-- see https://mcmap.net/q/65175/-mongodb-embedded-objects-have-no-id-null-value – Beattie 14/5, 2014 at 20:57

What he maybe means is that all mongoose objects are created with an _id for those who use this framework – see mongoose subdocs – Coaster 9/12, 2016 at 22:31

A very good book for learning mongo db relationships is "MongoDB Applied Design Patterns - O'Reilly". Chapter one, talk about this decision, to embed or reference? – Douglass 25/3, 2019 at 2:45

Thanks a lot for your reference @FelipeToledo, the book explains a lot with examples of how to weigh which way leads to which advantages & even draw-backs if any. Better someone new to NoSQL/MongoDB go thru the book for clarity on different issues! – Ehrenberg 18/4, 2022 at 11:29

883

This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:

Put as much in as possible

The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.

This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations

MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:

MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)

It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency

MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.

For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.

Fra answered 21/3, 2011 at 4:55 Comment(13)

I'd like to add to the OP question: My comments model contains the user name and link to his avatar. What would be the best approach, considering a user can modify his name/avatar? – Foretopsail 5/2, 2013 at 9:36

user, I'm not sure what "link" means in this context. I think I would embed if possible. – Fra 6/2, 2013 at 0:48

Regarding 'Complex data structures', it seems it is possible to return a subset of elements in a document using the aggregation framework (try $unwind). – Yardley 23/9, 2013 at 10:33

Errr, This technique was either not possibel or not widely known in MongoDB at the beginning of 2012. Given the popularity of this question, I would encourage you to write your own updated answer. I'm afraid I've stepped away from active development on MongoDB and I am not in a good position to address you comment within my original post. – Fra 27/9, 2013 at 1:51

16MB = 30 million tweets? ths menas about 0,5 byte per tweet?! – Sentimentality 15/11, 2014 at 19:13

It looks like they were off by a factor if 10^3. Assuming tweets are Unicode, worse case each character is 4 bytes. A tweet is 140 characters. This means, a tweet is roughly 560 bytes. There are 16000000 bytes in 16 MB. 1.6e7/560 = 28,571.4286 – Adolescent 8/9, 2017 at 15:11

Yes, it appears I was off by a factor of 1000 and some people find this important. I will edit the post. WRT 560bytes per tweet, when I rote this in 2011 twitter was still tied to text messages and Ruby 1.4 strings; in other words still ASCII chars only. – Fra 8/9, 2017 at 18:52

Can I keep access tokens to the same user's collection, or should I create a new collection for keeping the access tokens. What is the good practice? – Torn 2/9, 2018 at 12:12

Thanks for your anwser, can you provide an example for the Separate data that can be referred to from multiple places into its own collection and the Complex data structures sections please? – Onionskin 1/2, 2019 at 0:54

I don't understand why people even use NoSQL with so much limitations, you can never build anything fairly complex In NoSQL, and there will always be inefficiency in it if you do. – Awhile 12/7, 2021 at 5:59

@Abhishek Choudhary, relational databases impose their own inefficiencies. their inability to leverage domain knowledge about the data they store can lead to highly complex and slow queries for irregular data. Also remember underneath every SQL database is a non-structured datastore. I just all depends on what you want the datastore to do and what you want to handle yourself. – Fra 12/7, 2021 at 16:46

@John F. Miller That depends a lot on how you design the database, and even the worst SQL design will be better than equivalent NoSQL, maybe not in efficiency, but there are other more important requirements, Lack of foreign keys and joins makes it useless for most purposes, basic operations require multiple inserts or deletions, data is often repeated, It is not useful for anything beyond basic key-value data. – Awhile 13/7, 2021 at 7:42

@JohnF.Miller That's a pretty awesome explanation considering all the use cases. Just curious about the Complex Data Structure point: What if the we have a deeply nested document wherein we have another collection( assume it Collection B) as an embedded document, which is present across each node or level. And, in future if we modify that Collection B. How can we maintain the consistency throughout? Or what approach one should opt for in this case? Thanks in advance!! – Cannell 27/1, 2022 at 17:46

In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.

Cirillo answered 13/1, 2015 at 2:19 Comment(5)

can you please add a reference link? Thanks. – Neral 25/11, 2015 at 9:45

How do you find a specific comment with this design of one to many? – Raddled 9/8, 2019 at 1:58

coderwall.com/p/px3c7g/… – Hellenism 22/4, 2020 at 8:23

docs.mongodb.com/manual/tutorial/… @Neral – Homeomorphism 22/8, 2020 at 5:10

Embeddings are not the way to go in the one-to-many if the many in this case is a large number. In that case reference or partial embeddings should be used instead – Sonasonant 30/4, 2021 at 5:45

I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.

http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents

It summarized:

As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.

Smaller and/or fewer documents tend to be a natural fit for embedding.

Dicentra answered 2/6, 2016 at 15:4 Comment(4)

How much is a lot? 3? 10? 100? What's large? 1kb? 1MB? 3 fields? 20 fields? What is smaller / fewer? – Hadria 24/10, 2017 at 13:7

That's a good question, and one I don't have a specific answer for. The same presentation included a slide that said "A document, including all its embedded documents and arrays, cannot exceed 16MB", so that could be your cutoff, or just go with what seems reasonable/comfortable for your specific situation. In my current project, the majority of embedded documents are for 1:1 relationships, or 1:many where the embedded documents are really simple. – Dicentra 24/10, 2017 at 21:1

See also the current top comment by @john-f-miller, which while also not providing specific numbers for a threshold does contain some additional pointers that should help guide your decision. – Dicentra 24/10, 2017 at 21:5

Have a look at the below link from official Mongo website. It gives great, clear insight and describes more explicitly how much is 'a lot'. For example:

If there are more than a couple of hundred documents on the "many" side, don't embed them; if there are more than a few thousand documents on the "many" side, don't use an array of ObjectID references.

mongodb.com/developer/article/… – Comber 7/1, 2022 at 10:56

Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.

And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.

Don't really know what is the difference between the two relationships ? Here is a link explaining them: Aggregation vs Composition in UML

Eoin answered 5/11, 2018 at 6:53 Comment(3)

Why -1 ? Please give an explanation that would clarify the reason – Eoin 18/2, 2019 at 16:19

Your view about embedded and references actually gave me one more strong point to defend my view in the future. But in some cases if you are using composition and embedding like you said, the memory usage will increase for large docs even if we use projections to limit the fields. So, it is not entirely based on relationships. To actually increase the performance of read queries by avoiding reading whole doc, we can use references even though the design has composition. Maybe that's why -1 I guess. – Adan 14/10, 2020 at 16:53

Yes, you're right, one should also base his strategy depending on how he's going to retrieve the data, and the size of the embedded documents, +1 – Eoin 17/10, 2020 at 21:29

Well, I'm a bit late but still would like to share my way of schema creation.

I have schemas for everything that can be described by a word, like you would do it in the classical OOP.

E.G.

Comment
Account
User
Blogpost
...

Every schema can be saved as a Document or Subdocument, so I declare this for each schema.

Document:

Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)

Subdocument:

Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)

Insignificant answered 28/7, 2014 at 9:18 Comment(0)

If I want to edit a specified comment, how to get its content and its question?

You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).

This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.

In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.

Calebcaledonia answered 21/3, 2011 at 17:19 Comment(4)

this won't work if two comments have identical contents. one might argue that we could also add author to the search query, which still wouldn't work if the author made two identical comments with same content – Boxer 24/7, 2015 at 22:45

@SteelBrain: if he had kept the comment index, dot notation might help. see https://mcmap.net/q/63959/-mongodb-relationships-embed-or-reference – Perren 22/10, 2015 at 15:11

I don't understand how this answer has 34 upvotes, the second multiple people comment the same thing the whole system would break. This is an absolutely terrible design and should never be used. The way @user does it is the way to go – Gammadion 23/3, 2017 at 9:30

@Gammadion So what's the recommended way to fetch such comments? – Piper 8/4, 2021 at 9:23

Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.

var mongoose = require('mongoose')
  , Schema = mongoose.Schema
  
var personSchema = Schema({
  _id     : Number,
  name    : String,
  age     : Number,
  stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});

var storySchema = Schema({
  _creator : { type: Number, ref: 'Person' },
  title    : String,
  fans     : [{ type: Number, ref: 'Person' }]
});

The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.

Better you can get more information please visit: http://mongoosejs.com/docs/populate.html

Iota answered 18/9, 2014 at 8:45 Comment(1)

Mongoose will issue a seperate request for each populated field. This is different to SQL JOINS as they are performed on the server. This includes extra traffic between the app server and the mongodb server. Again, you might consider this when you're optimizing. Nevertheless, your anwser is still correct. – Carboniferous 2/12, 2015 at 14:46

I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:

db.question.update({'comments.content': 'xxx'}, {'comments.$': true})

Burkhard answered 25/9, 2013 at 20:16 Comment(2)

@SteelBrain: Well played sir, well played. – Monostrophe 7/8, 2018 at 19:53

MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,

There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.

When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).

So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.

https://www.mongodb.com/blog/post/building-with-patterns-a-summary

https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1

Aspire answered 2/9, 2020 at 13:50 Comment(1)

That's a good rule of thumb +1. If you have a lot of related data like comments. There can be millions of comments and you don't want to show them all so obviously it's better to store it in post_comments collection or something like that. – Nolly 23/3, 2021 at 9:8

If I want to edit a specified comment, how do I get its content and its question?

If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).

You could do f.ex.

db.questions.update(
    {
        "title": "aaa"       
    }, 
    { 
        "comments.0.contents": "new text"
    }
)

(as another way to edit the comments inside the question)

Perren answered 22/10, 2015 at 15:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags