MongoDB database schema design
Asked Answered
L

2

12

I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms json specifications for activity stream So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.

1 - Store each activity with all friends/followers in this pattern:

 

    {
     _id:'activ123',
     actor:{
            id:person1
            },
    verb:'follow',
    object:{
            objecttype:'person',
            id:'person2'
            },
    updatedon:Date(),
    consumers:[
            person3, person4, person5, person6, ... so on
            ]

    }

2 - Second design: Collection name- activity_stream_fanout


    {
    _id:'activ_fanout_123',
    personId:person3,
    activities:[
    {
     _id:'activ123',
     actor:{
            id:person1
            },
    verb:'follow',
    object:{
            objecttype:'person',
            id:'person2'
            },
    updatedon:Date(),
    }

    ],[
    //activity feed 2
    ]

    }


3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:


    { _id: "123",
      actor: { person: "UserABC" },
      verb: "follow",
      object: { person: "someone_else" },
      updatedOn: Date(...)

    } 

And then, for followers, I would have the following "notifications" documents:


    { activityId: "123", consumer: "someguy", updatedOn: Date(...) }
    { activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
    { activityId: "123", consumer: "thirdguy", updatedOn: Date(...) } 

Your answers are greatly appreciated.

Lentiginous answered 6/6, 2012 at 17:13 Comment(0)
P
20

I'd go with the following structure:

  1. Use one collection for all actions that happend, Actions

  2. Use another collection for who follows whom, Subscribers

  3. Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.

The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).

If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.

Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.

Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.

Preciosa answered 7/6, 2012 at 10:52 Comment(6)
great suggestions. With realtime I didn't mean subsecond, I just meant realtime as in fast enough that you wouldn't gain a lot from 'batching' multiple user activities in scenario 2 from the OP. Then again I'm not familiar with the term 'fanout' (which the second option of the OP seems to refers to, and you mention as well) so I may not have understood the intentions of 2. completely. .. Btw: Going to read that blogpost, always good to see architectual posts on MongoDB Schema designFend
great read, I've left a comment on your blog with a related question that you might want to read. ThanksFend
Guys, thanks a lot for the suggestions. I mark @Preciosa post as answer as it does make sense. I'll read your blog and see where it takes me. Again, thanks a log for all your suggestions.Lentiginous
@Preciosa : What scares me in this design is the fact that fanned-out "Newsfeed" collection will grow extremely fast. Let's say we have just 1000 registered users, each followed by 10 users, each making 10 actions a day. Then each 10 days "Newsfeed" collection will grow by 1 000 000 records. Please tell me how to deal with that.Chadchadabe
@mnemosyn: I'd also be interested in your take on the potential "huge-collection" problem pointed out above by oyatek. Any experiences/observations you can share in the meantime?Mensurable
Guys, seem I've missed your comments :-( It makes sense to remove old posts - most newsfeeds don't allow you to go back very far. Facebook uses a lot of caching for its timeline feature. You can't keep a detailed log of all activity of all users in RAM. The hard part is efficiently deleting old stuff. Another approach is to use a pre-allocated empty list for each user instead, again keeping the max size of entries per user constant. But even at 1M entries of 80b each and a two-server replica set, that is 160M RAM or roughly 0.07 US cent per user per month to keep the last 1000 entries in RAM.Preciosa
F
1

I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.

To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.

From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.

Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).

on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.

hth

Fend answered 6/6, 2012 at 22:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.