Finding mongoDB records in batches (using mongoid ruby adapter)

Asked 12/8, 2011 at 14:3 Answered 25/7, 2014 at 11:22

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).

The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.

What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...

The code I currently have does this:

Model.all.each do |r|
  Sunspot.index(r)
end

For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.

I would LIKE to be able to do something like this:

Model.all.in_batches_of(1000) do |batch|
  Sunpot.index(batch)
end

That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.

Ginetteginevra answered 12/8, 2011 at 14:3 Comment(2)

Are you sure you're seeing memory issues with this? Mongoid and the underlying Mongo driver already batch queries with a cursor. This keeps the memory footprint small. – Preferential 23/12, 2011 at 2:30

By the way, you should change the accepted answer to @RyanMcGeary's one - then all future visitos of your question will see the correct one and nobody won't implement manual optimization that is already done by the driver . – Abomination 11/10, 2013 at 11:28

With Mongoid, you don't need to manually batch the query.

In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.

For more information on this topic, read this comment from the Mongoid author and maintainer.

In summary, you can just do this:

Model.all.each do |r|
  Sunspot.index(r)
end

Preferential answered 23/12, 2011 at 2:29 Comment(9)

thanks for the info @RyanMcGeary, god how have i missed the cursor thing,,, in the link durran specified about batch_size, how can we specify that externally...? – Dungaree 23/12, 2011 at 7:59

@RameshVel, I'm not sure if Mongoid exposes the ability to change the batch_size per query. That might be a worthy patch if it isn't already an option. – Preferential 28/12, 2011 at 15:50

Nice, And what about other Enumerable methods like map or collect? – Rondure 7/10, 2013 at 12:39

so does that mean by default, the database is hit ~ n / 100 times everytime? – Refuel 21/7, 2015 at 19:20

@Edmund "Hit" probably isn't the best word to use here, because it implies re-running the query each time. It's a database cursor. Think of it more like streaming the data across in batches of 100. – Preferential 21/7, 2015 at 23:5

@RyanMcGeary link inside your answer is broken. Can you edit/correct? – Mistrustful 4/5, 2016 at 8:0

@Mistrustful Took me a while to find the same comment. It's been almost 5 years, and Mongoid has since switched from GitHub Issues to JIRA. I think I found the appropriate comment. – Preferential 8/5, 2016 at 21:51

For the record in recent versions the batch size internally usually starts at 100 but then increases to reduce the number of calls to the database. What's great about this also is that it works with all enumerable methods so if you want to gets your records in actual ruby batches (like arrays of 100), you can do: Model.all.each_slice(100) { |array| ... } – Roadwork 5/3, 2020 at 10:11

One MAJOR catch with this behavior that gets me over and over is that it doesn't work on relations, because the relation will store an "IdentityMap" of all the loaded records. For instance: person.purchases.each { ... } will load all "purchases" into memory, attached to the person instance. Instead, you have to call Purchase.where(person: person).each to avoid storing all the returned records in memory. – Tribadism 10/4, 2020 at 23:9

If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.

require 'mongoid'

module Mongoid
  class Criteria
    def in_batches_of(count = 100)
      Enumerator.new do |y|
        total = 0

        loop do
          batch = 0

          self.limit(count).skip(total).each do |item|
            total += 1
            batch += 1
            y << item
          end

          break if batch == 0
        end
      end
    end
  end
end

Here is a helper method you can use to add the batching functionality. It can be used like so:

Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
  # call external slow API
end

Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.

Device answered 16/5, 2014 at 17:26 Comment(2)

The .no_timeout method on criteria saves you from having to manually reconnect: Post.all.order_by(:id => 1).batch_size(7).no_timeout.each_with_index do ... – Doublequick 14/2, 2015 at 13:37

@Doublequick In some cases, this doesn't work; even with no_timeout, it will timeout regardless. I don't know what the limit was, but from what I observed, it will timeout if you iterate over the collection for around 2-3 hours. – Zaccaria 2/5, 2022 at 9:33

It is faster to send batches to sunspot as well. This is how I do it:

records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
  records << r
  if records.size > 1000
    Sunspot.index! records
    records.clear
  end
end
Sunspot.index! records

no_timeout: prevents the cursor to disconnect (after 10 min, by default)

only: selects only the id and the fields, which are actually indexed

batch_size: fetch 1000 entries instead of 100

Enclose answered 4/3, 2013 at 19:33 Comment(2)

remember to 'Sunspot.index! records' after the loop or you won't index the last group of < 1000 I believe – Noletta 5/6, 2014 at 4:22

Correct. I forgot to copy this part. – Enclose 16/6, 2014 at 19:55

I am not sure about the batch processing, but you can do this way

current_page = 0
item_count = Model.count
while item_count > 0
  Model.all.skip(current_page * 1000).limit(1000).each do |item|
    Sunpot.index(item)
  end
  item_count-=1000
  current_page+=1
end

But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,

i have created a resque job which updates the solr index

class SolrUpdator
 @queue = :solr_updator

 def self.perform(item_id)
   item = Model.find(item_id)
   #i have used RSolr, u can change the below code to handle sunspot
   solr = RSolr.connect :url => Rails.application.config.solr_path
   js = JSON.parse(item.to_json)
   solr.add js         
 end

end

After adding the item, i just put an entry to the resque queue
```
Resque.enqueue(SolrUpdator, item.id.to_s)
```
Thats all, start the resque and it will take care of everything

Dungaree answered 12/8, 2011 at 14:39 Comment(2)

Ramesh, the first block of code you provided works very well for my use case. It's just a one-time load and index of the data using a script file, so using resque may be overkill for my particular case. But the batching ability works perfectly! – Ginetteginevra 12/8, 2011 at 17:0

This isn't necessary. Mongoid and the underlying Mongo driver already batch queries with a cursor. This keeps the memory footprint small. – Preferential 23/12, 2011 at 2:30

-3

As @RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.

Model.all.to_a.in_groups_of(1000, false) do |records|
  Sunspot.index! records
end

Pittance answered 28/2, 2012 at 21:5 Comment(3)

Model.all.to_a would load the entire collection into memory. – Astrahan 20/12, 2012 at 23:52

That's right, please don't do this: When we're talking about large datasets avoid converting entire collection to array at once: use Model.find_each or batch in any way but never Model.all.to_a – Gottlieb 7/10, 2013 at 20:1

Model.find_each is not a Mongoid method. You would use Model.all.each instead. – Fylfot 28/1, 2015 at 23:8

-3

The following will work for you , just try it

Model.all.in_groups_of(1000, false) do |r|
  Sunspot.index! r
end

Durrett answered 25/7, 2014 at 11:22 Comment(3)

Loading all the database in memory... duh. The whole point of this is to be able to query documents in batches, if you have 4 million documents you will kill your server by first loading them into a single array, and then another array of groups. – Doublequick 14/2, 2015 at 13:39

@Doublequick please check the above solution , the same what I given he explained. thanks for explanation Ryan McGeary. – Durrett 17/3, 2015 at 15:57

in_groups_of is a Rails Array method, to be used you should convert Model.all to an array, which is not recommended at all. The -1 is to warn people to not do that. – Peony 19/3, 2020 at 14:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags