MongoDB, MapReduce and sorting
Asked Answered
M

2

11

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.

Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.

I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.

My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.

db.runCommand({
  'mapreduce': 'products',
  'map': function() {
    emit({
      product_id: this.product_id,
      popularity: this.popularity
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
  'sort': {popularity: -1}
});

I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:

{ 
  "v" : 1, 
  "key" : { "popularity" : -1 }, 
  "ns" : "app.products", 
  "name" : "popularity_-1" 
}

I just cannot figure out why it doesn't want to sort.

Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.

Munda answered 18/8, 2012 at 1:25 Comment(5)
have you considered using 2.2 and aggregation framework?Asepsis
Sorting currently isn't supported for an inline Map/Reduce (see SERVER-3973 to watch/vote on this feature). An inline Map/Reduce (or an aggregate query) is also limited to the max document size (currently 16Mb in MongoDB 2.0). What is the issue preventing output to a temporary collection which could be queried in sort order? Another option would be to sort the result in your client app, eg using usort().Wabble
@Aysa: I have, but 2.2 isn't production ready yet.Munda
@Stennie: Seems to be the case. The issue preventing me from outputting it to a temporary collection is that I can't have this feature creating a new collection for every request that comes into the app. The Mongo docs don't read like I can create a "temporary" collection anymore anyways. Of course I can delete it after I'm done, but that doesn't seem like a great solution. And I'm not worried about the 16MB document size limit at all.Munda
@JonUrsenbach: The second release candidate for MongoDB 2.2 came out this week (rc1, following rc0 in mid-July). I expect a production release isn't far off; would be worth testing.Wabble
P
15

First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.

What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.

This is how I generated my dataset

function generate_dummy_data() {
    for (i=2; i < 1000000; i++) { 
        db.foobar.save({
          _id: i, 
         category_id: parseInt(Math.random() * 30), 
         popularity:    parseInt(Math.random() * 50)
        }) 
    }
}

And this my map/reduce task:

var data = db.runCommand({
  'mapreduce': 'foobar',
  'map': function() {
    emit({
      sorting: this.popularity * -1,
      product_id: this._id,
      popularity: this.popularity,
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
});

And this is the end result (very long to paste it here):

http://cesarodas.com/results.txt

This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.

Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).

Pleuro answered 18/8, 2012 at 6:16 Comment(4)
Nice workaround .. I didn't realise the emit keys would maintain sort order even if no explicit sort set :). Wonder if this is an accidental side effect; the Map/Reduce docs actually mention sort key as "Useful for optimization, like sorting by the emit key for fewer reduces".Wabble
This is not a Mongo feature, it is a map/reduce feature that applies to the algorithm, it must be sorted by key (so it applies to CouchDB, Hadoop, etc). I've learned watching some video of Cloudera. As far as I see mongo's sort is only to sort the input not the outputPleuro
This works beautifully. Not sure if restructuring this data, again, is possible so it looks like I'll just have to make due with M/R until 2.2 is stable and production ready and revisit with the new aggregation framework. Thanks again, crodas.Munda
@Pleuro Actually sorting the input could be beneficial to MR performance this post discuss about it, and I tested in my analytics DB and it does help indeed (maybe not 6x as the article states). Cheers.Ranit
W
10

As noted in discussion on the original question:

  • Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see @crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().

  • OP's preference is for inline results rather than creating/deleting temporary collections.

  • The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.

Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:

db.products.aggregate(
  { $match: { category_id: 20 }},
  { $group : {
     _id : "$product_id",
     'popularity' : { $sum : "$popularity" },
  }},
  { $sort: { 'popularity': -1 }}
)

.. and sample output:

{
    "result" : [
        {
            "_id" : 50,
            "popularity" : 139
        },
        {
            "_id" : 150,
            "popularity" : 99
        },
        {
            "_id" : 123,
            "popularity" : 55
        }
    ],
    "ok" : 1
}
Wabble answered 18/8, 2012 at 6:2 Comment(1)
Going to go with Crodas' solution for now until 2.2 is production ready and then revisit with the aggregation framework. Thanks again.Munda

© 2022 - 2024 — McMap. All rights reserved.