Efficient paging in MongoDB using mgo
Asked Answered
S

1

10

I've searched and found no Go solution to the problem, not with or without using mgo.v2, not on StackOverflow and not on any other site. This Q&A is in the spirit of knowledge sharing / documenting.


Let's say we have a users collection in MongoDB modeled with this Go struct:

type User struct {
    ID      bson.ObjectId `bson:"_id"`
    Name    string        `bson:"name"`
    Country string        `bson:"country"`
}

We want to sort and list users based on some criteria, but have paging implemented due to the expected long result list.

To achieve paging of the results of some query, MongoDB and the mgo.v2 driver package has built-in support in the form of Query.Skip() and Query.Limit(), e.g.:

session, err := mgo.Dial(url) // Acquire Mongo session, handle error!

c := session.DB("").C("users")
q := c.Find(bson.M{"country" : "USA"}).Sort("name", "_id").Limit(10)

// To get the nth page:
q = q.Skip((n-1)*10)

var users []*User
err = q.All(&users)

This however becomes slow if the page number increases, as MongoDB can't just "magically" jump to the xth document in the result, it has to iterate over all the result documents and omit (not return) the first x that need to be skipped.

MongoDB provides the right solution: If the query operates on an index (it has to work on an index), cursor.min() can be used to specify the first index entry to start listing results from.

This Stack Overflow answer shows how it can be done using a mongo client: How to do pagination using range queries in MongoDB?

Note: the required index for the above query would be:

db.users.createIndex(
    {
        country: 1,
        name: 1,
        _id: 1
    }
)

There is one problem though: the mgo.v2 package has no support specifying this min().

How can we achieve efficient paging that uses MongoDB's cursor.min() feature using the mgo.v2 driver?

Sonyasoo answered 16/11, 2016 at 14:38 Comment(0)
S
21

Unfortunately the mgo.v2 driver does not provide API calls to specify cursor.min().

But there is a solution. The mgo.Database type provides a Database.Run() method to run any MongoDB commands. The available commands and their documentation can be found here: Database commands

Starting with MongoDB 3.2, a new find command is available which can be used to execute queries, and it supports specifying the min argument that denotes the first index entry to start listing results from.

Good. What we need to do is after each batch (documents of a page) generate the min document from the last document of the query result, which must contain the values of the index entry that was used to execute the query, and then the next batch (the documents of the next page) can be acquired by setting this min index entry prior to executing the query.

This index entry –let's call it cursor from now on– may be encoded to a string and sent to the client along with the results, and when the client wants the next page, he sends back the cursor saying he wants results starting after this cursor.

Doing it manually (the "hard" way)

The command to be executed can be in different forms, but the command name (find) must be first in the marshaled result, so we'll use bson.D (which preserves order in contrast to bson.M):

limit := 10
cmd := bson.D{
    {Name: "find", Value: "users"},
    {Name: "filter", Value: bson.M{"country": "USA"}},
    {Name: "sort", Value: []bson.D{
        {Name: "name", Value: 1},
        {Name: "_id", Value: 1},
    },
    {Name: "limit", Value: limit},
    {Name: "batchSize", Value: limit},
    {Name: "singleBatch", Value: true},
}
if min != nil {
    // min is inclusive, must skip first (which is the previous last)
    cmd = append(cmd,
        bson.DocElem{Name: "skip", Value: 1},
        bson.DocElem{Name: "min", Value: min},
    )
}

The result of executing a MongoDB find command with Database.Run() can be captured with the following type:

var res struct {
    OK       int `bson:"ok"`
    WaitedMS int `bson:"waitedMS"`
    Cursor   struct {
        ID         interface{} `bson:"id"`
        NS         string      `bson:"ns"`
        FirstBatch []bson.Raw  `bson:"firstBatch"`
    } `bson:"cursor"`
}

db := session.DB("")
if err := db.Run(cmd, &res); err != nil {
    // Handle error (abort)
}

We now have the results, but in a slice of type []bson.Raw. But we want it in a slice of type []*User. This is where Collection.NewIter() comes handy. It can transform (unmarshal) a value of type []bson.Raw into any type we usually pass to Query.All() or Iter.All(). Good. Let's see it:

firstBatch := res.Cursor.FirstBatch
var users []*User
err = db.C("users").NewIter(nil, firstBatch, 0, nil).All(&users)

We now have the users of the next page. Only one thing left: generating the cursor to be used to get the subsequent page should we ever need it:

if len(users) > 0 {
    lastUser := users[len(users)-1]
    cursorData := []bson.D{
        {Name: "country", Value: lastUser.Country},
        {Name: "name", Value: lastUser.Name},
        {Name: "_id", Value: lastUser.ID},
    }
} else {
    // No more users found, use the last cursor
}

This is all good, but how do we convert a cursorData to string and vice versa? We may use bson.Marshal() and bson.Unmarshal() combined with base64 encoding; the use of base64.RawURLEncoding will give us a web-safe cursor string, one that can be added to URL queries without escaping.

Here's an example implementation:

// CreateCursor returns a web-safe cursor string from the specified fields.
// The returned cursor string is safe to include in URL queries without escaping.
func CreateCursor(cursorData bson.D) (string, error) {
    // bson.Marshal() never returns error, so I skip a check and early return
    // (but I do return the error if it would ever happen)
    data, err := bson.Marshal(cursorData)
    return base64.RawURLEncoding.EncodeToString(data), err
}

// ParseCursor parses the cursor string and returns the cursor data.
func ParseCursor(c string) (cursorData bson.D, err error) {
    var data []byte
    if data, err = base64.RawURLEncoding.DecodeString(c); err != nil {
        return
    }

    err = bson.Unmarshal(data, &cursorData)
    return
}

And we finally have our efficient, but not so short MongoDB mgo paging functionality. Read on...

Using github.com/icza/minquery (the "easy" way)

The manual way is quite lengthy; it can be made general and automated. This is where github.com/icza/minquery comes into the picture (disclosure: I'm the author). It provides a wrapper to configure and execute a MongoDB find command, allowing you to specify a cursor, and after executing the query, it gives you back the new cursor to be used to query the next batch of results. The wrapper is the MinQuery type which is very similar to mgo.Query but it supports specifying MongoDB's min via the MinQuery.Cursor() method.

The above solution using minquery looks like this:

q := minquery.New(session.DB(""), "users", bson.M{"country" : "USA"}).
    Sort("name", "_id").Limit(10)
// If this is not the first page, set cursor:
// getLastCursor() represents your logic how you acquire the last cursor.
if cursor := getLastCursor(); cursor != "" {
    q = q.Cursor(cursor)
}

var users []*User
newCursor, err := q.All(&users, "country", "name", "_id")

And that's all. newCursor is the cursor to be used to fetch the next batch.

Note #1: When calling MinQuery.All(), you have to provide the names of the cursor fields, this will be used to build the cursor data (and ultimately the cursor string) from.

Note #2: If you're retrieving partial results (by using MinQuery.Select()), you have to include all the fields that are part of the cursor (the index entry) even if you don't intend to use them directly, else MinQuery.All() will not have all the values of the cursor fields, and so it will not be able to create the proper cursor value.

Check out the package doc of minquery here: https://godoc.org/github.com/icza/minquery, it is rather short and hopefully clean.

Sonyasoo answered 16/11, 2016 at 14:38 Comment(17)
Why var users []*User and not var users []User ?Lenis
@EzequielMoreno Both works; I usually use a slice of pointers because I can easily post-process the slice with a for range, and it's more efficient to pass around individual users.Sonyasoo
@Sonyasoo can you provide an example of the getLastCursor() function? There is no example in doc and I'm not sure what cursor should be initialized to.Matabele
@Matabele getLastCursor() is a non-existing function, it is supposed to represent your logic how you acquire the last cursor. When you query the first page, cursor should be the empty string (you don't have to call Minquery.Cursor() to get the first page). MinQuery.All() returns you the docs of the first page and a cursor. To get the second (next) page, use this cursor. In general, to get the next page, use the cursor returned by the previous MinQuery.All() call.Sonyasoo
@Sonyasoo thank you. I took a bit of a shortcut and just initialized a global variable, which starts as an empty string, then takes the contents of newCursor as the cursor changes. This is a fantastic package. Makes my life much easier.Matabele
@Sonyasoo I had some problems with skipping results using your package. Setting the skip value back to 0 fixed it for me, but I was using a simple global variable instead of writing a function to set cursors. I created a humble pull request, feel free to reject it if you disagree.Matabele
@Matabele The skip(1) is the right way as the entry / document specified by the cursor is implicit, meaning the cursor is generated from the last result, and using it without skip would again return the last result, hence the skip(1) to start listing elements after the last result. I've tested and it works correctly. The error is most likely in your code which I don't know. You should post it as a new question.Sonyasoo
Ok, I closed the pull request. Anyways thank you. This is a really useful package.Matabele
@Matabele I posted a full working example on your pull request in a comment which demonstrates the correct working. Please check it: full working example.Sonyasoo
Let us continue this discussion in chat.Matabele
is this able to handle pipelines?Lenis
@EzequielMoreno MinQuery utilizes / operates on a MongoDB index, so it supports what an index supports: sorting, projecting and limiting the results. Plus setting the index entry of the first element to retrieve.Sonyasoo
I just wanted to stop in and say this is one of the best answers I've ever read on this site.Lennyleno
how do you return the total count using minquery?Pack
@EdwinIkechukwu You don't need MinQuery to query the total number of records. Simply execute a count-query with the query you pass to minquery.New(), something like: coll.Find(query).Count().Sonyasoo
@Sonyasoo hi, have a question about having cursorFields in MinQuery.All(...). Why this fields are not constructed from filter and sort? What is the reason to keep this explicit?Wrung
@valichek Filter and sorting fields may be completely different than the cursor fields. To allow this kind of flexibility, you have to be explicit about the cursor fields.Sonyasoo

© 2022 - 2024 — McMap. All rights reserved.