MongoDB Aggregation with $sample very slow

Asked 7/6, 2016 at 12:54 Answered 5/12, 2020 at 7:47

There are many ways to select random document from a mongodb collection (as discussed in this answer). Comments point out that with mongodb version >= 3.2 then using $sample in the aggregation framework is preferred. However, on a collection with many small documents this seems to extremely slow.

The following code uses mongoengine to simulate the issue and compare it to the "skip random" method:

import timeit
from random import randint

import mongoengine as mdb

mdb.connect("test-agg")


class ACollection(mdb.Document):
    name = mdb.StringField(unique=True)

    meta = {'indexes': ['name']}


ACollection.drop_collection()

ACollection.objects.insert([ACollection(name="Document {}".format(n)) for n in range(50000)])


def agg():
    doc = list(ACollection.objects.aggregate({"$sample": {'size': 1}}))[0]
    print(doc['name'])

def skip_random():
    n = ACollection.objects.count()
    doc = ACollection.objects.skip(randint(1, n)).limit(1)[0]
    print(doc['name'])


if __name__ == '__main__':
    print("agg took {:2.2f}s".format(timeit.timeit(agg, number=1)))
    print("skip_random took {:2.2f}s".format(timeit.timeit(skip_random, number=1)))

The result is:

Document 44551
agg took 21.89s
Document 25800
skip_random took 0.01s

Wherever I've had performance issues with mongodb in the past my answer has always been to use the aggregation framework so I'm surprised $sample is so slow.

Am I missing something here? What is it about this example that is causing the aggregation to take so long?

Colis answered 7/6, 2016 at 12:54 Comment(3)

What MongoDB version are you running? I found that $sample was very slow in 3.2.5, but basically instantaneous in 3.2.7. – Unveiling 7/6, 2016 at 13:46

ah, 3.2.0 - that's going to be it then. yes, this shows that it was a known bug. – Colis 7/6, 2016 at 13:51

Right, but I'm not sure why it was still slow for me with 3.2.5 with a new collection of 1M docs as that was marked as fixed in 3.2.3. – Unveiling 7/6, 2016 at 13:54

This is a result of a known bug in the WiredTiger engine in versions of mongodb < 3.2.3. Upgrading to the latest version should solve this.

Colis answered 13/6, 2016 at 10:47 Comment(1)

We are using MongoDB 4.2.1 . $sample is still very slow. 87K size on a collection of 500 Millions. Takes 20 Min. Server Config is 16c/ 240Gb – Drear 28/11, 2019 at 16:56

I can confirm that nothing has changed in 3.6 Slow $sample problem persists.

~40m collection of small documents, no indexes, Windows Server 2012 x64.

storage: wiredTiger.engineConfig.journalCompressor: zlib wiredTiger.collectionConfig.blockCompressor: zlib

2018-04-02T02:27:27.743-0700 I COMMAND [conn4] command maps.places

command: aggregate { aggregate: "places", pipeline: [ { $sample: { size: 10 } } ],

 cursor: {}, lsid: { id: UUID("0e846097-eecd-40bb-b47c-d77f1484dd7e") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: MULTI_ITERATOR keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:3967 nreturned:10 reslen:550 locks:{ Global: { acquireCount: { r: 7942 } }, Database: { acquireCount: { r: 3971 } }, Collection: { acquireCount: { r: 3971 } } }

protocol:op_query 72609ms

I have installed Mongo to try this "modern and performant DBMS" in a serious project. How deeply I am frustrated.

Explain plan is here:

db.command('aggregate', 'places', pipeline=[{"$sample":{"size":10}}], explain=True)

 {'ok': 1.0,
  'stages': [{'$cursor': {'query': {},
    'queryPlanner': {'indexFilterSet': False,
     'namespace': 'maps.places',
     'plannerVersion': 1,
     'rejectedPlans': [],
     'winningPlan': {'stage': 'MULTI_ITERATOR'}}}},
  {'$sampleFromRandomCursor': {'size': 10}}]}

Siliculose answered 2/4, 2018 at 9:44 Comment(0)

For those who are confused with $sample , $sample would be efficient under following conditions:

$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents

If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents.

More on: https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

Bayonet answered 8/6, 2020 at 5:4 Comment(0)

This is a result of a known bug in the WiredTiger engine in versions of mongodb < 3.2.3. Upgrading to the latest version should solve this.

Colis answered 13/6, 2016 at 10:47 Comment(1)

We are using MongoDB 4.2.1 . $sample is still very slow. 87K size on a collection of 500 Millions. Takes 20 Min. Server Config is 16c/ 240Gb – Drear 28/11, 2019 at 16:56

Mongo states that

If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:

$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents

If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents. In this case, the $sample stage is subject to the sort memory restrictions.

I believe in your case mongo makes full scan

Reference: https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

Queston answered 5/12, 2020 at 7:47 Comment(0)

Recommended topics

Hot tags