Strategies for fast searches of billions of small documents in MongoDB
Asked Answered
A

1

24

I need to store several billion small data structures (around 200 bytes each). So far, storing each element as a separate document is working well, with Mongo providing around 10,000 results per second. I'm using a 20-byte hash as the _id for each document, and a single index on the _id field. In testing, this is working for data sets with 5,000,000 documents.

In operation, we will be making around 10,000 requests per second, updating existing documents about 1,000 times per second, and inserting new documents maybe 100 times per second or less.

How can we manage larger data sets, when we cannot store an entire index in RAM? Will MongoDB perform better if we combine several elements into each document -- for a faster search through the index, but more data being returned in each query?

Unlike other questions on SO, I'm not only interested in how much data we can stuff into Mongo. It can clearly manage the amount of data we're looking at. My concern is how can we maximize the speed of find operations on huge collections, given constrained RAM.

Our searches will tend to be clustered; around 50,000 elements will be satisfy about 50% of the queries, but the remaining 50% will be randomly distributed across all of the data. Can we expect a performance gain by moving those 50% into their own collection, in order to keep a smaller index of the most-used data always in ram?

Would reducing the size of the _id field from 20-bytes to 8-bytes have a significant impact on MnogoDB's indexing speed?

Acoustician answered 19/7, 2013 at 17:58 Comment(5)
As it sounds like you'll have far more documents than RAM, I'd shrink the documents as much as possible to increase the amount of data that can fit in RAM. Make sure field names are just one or two characters for example. Are you planning on sharding? Moving data into a distinct collection on the same server won't change RAM usage -- as it's OS managed anyway.Disposure
We will be sharding as the data grows.Acoustician
Putting the most-used records into a different collection is just an idea, in order to keep the index for this smaller collection in RAM and try to keep it from being swapped out. I think this might be naive, but I'm not certain why or why not.Acoustician
The memory for indexes is not managed independently of the memory needed for the working set. It's all managed by the OS. If the indexes are frequently used (more so than data), it should remain in memory.Disposure
Frankly, it's difficult to tell without a lot more detail, but it seems to me that optimizing MongoDB's large-N query performance isn't the right problem to be solving here. Your question about moving the less-used data into a separate collection is a step in that direction, but I'd take it further: keep the full dataset in mongo, and have independent copies of those 50k high-volume records closer to where you're handling requests. But at this point we're into what your consistency requirements are... aka, "fun territory".Lulita
M
33

A few strategies come to mind:

1) Use a distinct collection/database for the 'hot' documents.

If you know which documents are in the hot set then, yes, moving them into a separate collection will help. This will ensure that the hot documents are co-resident on the same extents/pages. It will also make the index for those documents more likely to be completely in memory. This is due to it being smaller and being (completely?) used more often.

If the hot documents are randomly mixed with other documents then you will likely have to fault in more of the leaf elements of the B-Tree index when loading a document as the probability of another document having recently loaded or accessed the index block is small.

2) Shorten the indexed values.

The shorter the index value the more values that fit into a single B-Tree block. (Note: The keys are not included in the index.) The more entries in a single bucket means fewer buckets and less total memory needed for the index. That translates to the higher probability / longer lifetimes that blocks will stay in memory. In your example a 20->8 character reduction is a better than 50% savings. If you can convert those 8 bytes to a long there is a little more savings since longs do not have a length prefix (4 bytes) and a trailing null (5 bytes total).

3) Shorten the key names.

The shorter the field names the less space each document takes. This has the unfortunate side effect of decreasing readability.

4) Shard

This is really the only way to keep performance up in the face of reads across an entire corpus that exhausts memory and eventual disk bandwidth. If you do shard you will still want to shard the 'hot' collection.

5) Adjust the read-ahead on disk to a small value.

Since the 'non-hot' reads are loading a random document from disk we really only want to read/fault into memory that document and as few of the documents around it as possible. Most systems will try and read-ahead a large block of data once a user reads from a portion of a file. This is exactly the opposite of what we want.

If you see your system faulting a lot but the resident memory for the mongod process does not approach the systems available memory you are likely seeing the effect of the OS reading useless data.

6) Try to use monotonically increasing values for the keys.

This will trigger an optimization (for ObjectId based indexes) that when the index block splits it will do so at 90/10 instead of 50/50. The result is that most of the blocks in your index will be near capacity and you will need fewer of them.

If you only know the 'hot' 50,000 documents after the fact then adding them to the separate collection in index order will also trigger this optimization.

Rob.

Marketplace answered 22/7, 2013 at 1:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.