Manually Setting the Seed for MongoDB $sample
Asked Answered
K

3

24

I am using the $sample query for mongo aggregation. in the following manner:

db.col.aggregate([
    {$match: {topic: topic}},
    {$sample: {'size': 10}}
    {$project: {_id: 1}}
])

My question is, is there a way to set the 'seed' for the sampling, so that every time I run this command I get the same result ?

For example, in python I do it like the following:

import random
list_of_items = [...]

# set the seed to 0 
random.seed(0)

# get sample 
samples = random.sample(list_of_items, 10)

By manually defining the seed, I make sure that the result is the same every time I do this operation.

Kroon answered 18/4, 2016 at 9:56 Comment(4)
No there is not. Otherwise it would not be a "random sample". If you want a list of the same things all the time, then store the selected _id values and supply those with an $in query instead.Bedspring
@Kroon did you find anything to fix your problem? I haven't found any reasonable solution to my problem.Hilaire
@NeilLunn: there are a number of use-cases for seeding the aggregation operation, particularly for reproducibility (e.g. testing purposes, machine learning, and so on)Piccalilli
@Piccalilli is there a term I can look up for these use-cases? I tried googling (for example "mongo aggregate "$sample", random seed" ) but I haven't been able to find anything.Ipomoea
E
3

One of the workarounds we used for similar issues is we use $out after $sample to create a 'snapshot' collection. We then work on the 'snapshot' collection to perform experiments with reproducible behaviors.

Another advantage we gained is we can perform indexing on the 'snapshot' collection to speed up our experiments per our need.

Evesham answered 16/2, 2021 at 2:41 Comment(0)
F
2

You may do a workaround until the mongodb team implement this feature.

You can assign a random id [0; 1] to your documents and sort+limit them by this id.

Fisher answered 20/11, 2023 at 10:44 Comment(2)
I would suggest adding a unique index to the new random column too, as multiple documents could possibly share the same random value and still cause indeterministic result set.Evesham
@Evesham Well seen Ray !!! :)Fisher
H
0

This is not currently possible but you may request this feature via https://feedback.mongodb.com/.

Heterosporous answered 27/11, 2020 at 2:5 Comment(1)
jira.mongodb.org/browse/SERVER-32928Fisher

© 2022 - 2025 — McMap. All rights reserved.