How do I use a Rails cache to store Nokogiri objects?
Asked Answered
N

2

9

I'm using Rails 5 to use a Rails cache to store Nokogiri objects.

I created this in config/initializers/cache.rb:

$cache = ActiveSupport::Cache::MemoryStore.new

and I wanted to store documents like:

$cache.fetch(url) {
  result = get_content(url, headers, follow_redirects)
}

but I'm getting this error:

Error during processing: (TypeError) no _dump_data is defined for class Nokogiri::HTML::Document
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:671:in `dump'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:671:in `dup_value!'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache/memory_store.rb:128:in `write_entry'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:398:in `block in write'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:562:in `block in instrument'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/notifications.rb:166:in `instrument'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:562:in `instrument'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:396:in `write'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:596:in `save_block_result_to_cache'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.2/lib/active_support/cache.rb:300:in `fetch'
/Users/davea/Documents/workspace/myproject/app/helpers/webpage_helper.rb:116:in `get_cached_content'
/Users/davea/Documents/workspace/myproject/app/helpers/webpage_helper.rb:73:in `get_url'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:29:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:26:in `block (2 levels) in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:348:in `run_task'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:337:in `block (3 levels) in create_worker'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `loop'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (2 levels) in create_worker'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:319:in `catch'
/Users/davea/.rvm/gems/ruby-2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_thread_pool_executor.rb:319:in `block in create_worker'

What do I need to do in order to be able to store these objects in a cache?

Nynorsk answered 19/4, 2017 at 15:4 Comment(3)
Apparently, not. Cache is good for storing strings, though.Contretemps
Why would you want to store an object? Store a serialized hash or array containing information you scraped from HTML or XML using Nokogiri. If you need to store objects then look into memoization.Complex
Thinking about this more, a memory cache is good for things you need to access immediately, but, if the machine goes down, can also be recreated quickly. If you're using Nokogiri, odds are good you're scraping a page, which implies you're loading that page, and the load, parse, scrape process adds latency you don't want (hence the idea to use a cache), but instead you should gather your (meta)data and store it in your database where it's permanently available. The DBM will cache internally. It's not as fast as an in-memory cache but it's better than recreating on request or at app startup.Complex
C
2

User Nokogiri's Serialize functionality:

$cache = ActiveSupport::Cache::MemoryStore.new 
noko_object = Nokogiri::HTML::Document.new

serial = noko_object.serialize
$cache.write(url, serial)
// Serialized Nokogiri document is now in store at the URL key.
result = $cache.read(url)

noko_object = Nokogiri::HTML::Document.new(result)
// noko_object is now the original document again :)

Check out the documentation here for more information.

Clinical answered 27/4, 2017 at 0:43 Comment(7)
Thanks but what is the code for "Store serialized object in cache"? I thought the body of the "$cache.fetch(url) {" would take care of storing and then retrieiving things?Nynorsk
You very well may not need anything there, I was thinking you may be doing something additional there. Simply skip it, what you are looking for is serialize.Clinical
yes but this is still failing because "get_content" returns a Nokogiri Document (the method cannot be altered, sadly) and as such it is causing the outer "$cache.fetch" to fail with the error I listed. Assume I know nothing (which is basically true) and please spell it out for me. How do I write a method that returns a Nokogiri Document that utilizes my Rails cache?Nynorsk
Dave did you check Edit3? fetch and write are two different calls. It looks like some place the Nokogiri object not the string is being saved to the cacheLipps
@Nynorsk I have updated my answer, this should be all you need to do to store the serialized Nokogiri object and retrieve it. :)Clinical
Oh ok, so to tie it all together, if I call $cache.fetch(url) by itself, should that just return the serialized string or nil if there is no entry in the $cache?Nynorsk
@Dave, that is correct - api.rubyonrails.org/classes/ActiveSupport/Cache/…Clinical
L
3

Store the xml as string, not the object and parse them once you get them out of the cache.

Edit: response to comment

Cache this instead

nokogiri_object.to_xml

Edit2: response to comment. Something along this lines. You will need to post more code if you want more specific help.

nokogiri_object = Nokogiri::XML(cache.fetch('xml_doc'))

Edit3: Response to 'Thanks but what is the code for "Store serialized object in cache"? I thought the body of the "$cache.fetch(url) {" would take care of storing and then retrieving things?'

cache.write('url', xml_or_serialized_nokogiri_string)
Lipps answered 21/4, 2017 at 23:38 Comment(4)
Hi, Where are you converting this back to a Nokogiri doc? My requirement is I need to invoke a cache method that allows me to store and retrieve Nokogiri docs. If they take some intermediate form in between, that's fine, but the end result must be a Nokogiri doc. I'm still not seeing how to achieve this with what you have provided.Nynorsk
Done! See my code in my question. The get_cached_data method contains the code strting with "$cache.fetch(url) {". The method "get_content(url, headers, follow_redirects)" returns the Nokogiri doc. So where do I take the result of that, convert it to XML, and then convert it back to a String?Nynorsk
can you add the get_content method code? What you should be saving in the cache is a string (xml)Lipps
get_content is a really long method that downloads a file from teh internet and converts it to a Nokogiri Document. Know that it returns a Nokogiri document or nil (if nothing was returned).Nynorsk
C
2

User Nokogiri's Serialize functionality:

$cache = ActiveSupport::Cache::MemoryStore.new 
noko_object = Nokogiri::HTML::Document.new

serial = noko_object.serialize
$cache.write(url, serial)
// Serialized Nokogiri document is now in store at the URL key.
result = $cache.read(url)

noko_object = Nokogiri::HTML::Document.new(result)
// noko_object is now the original document again :)

Check out the documentation here for more information.

Clinical answered 27/4, 2017 at 0:43 Comment(7)
Thanks but what is the code for "Store serialized object in cache"? I thought the body of the "$cache.fetch(url) {" would take care of storing and then retrieiving things?Nynorsk
You very well may not need anything there, I was thinking you may be doing something additional there. Simply skip it, what you are looking for is serialize.Clinical
yes but this is still failing because "get_content" returns a Nokogiri Document (the method cannot be altered, sadly) and as such it is causing the outer "$cache.fetch" to fail with the error I listed. Assume I know nothing (which is basically true) and please spell it out for me. How do I write a method that returns a Nokogiri Document that utilizes my Rails cache?Nynorsk
Dave did you check Edit3? fetch and write are two different calls. It looks like some place the Nokogiri object not the string is being saved to the cacheLipps
@Nynorsk I have updated my answer, this should be all you need to do to store the serialized Nokogiri object and retrieve it. :)Clinical
Oh ok, so to tie it all together, if I call $cache.fetch(url) by itself, should that just return the serialized string or nil if there is no entry in the $cache?Nynorsk
@Dave, that is correct - api.rubyonrails.org/classes/ActiveSupport/Cache/…Clinical

© 2022 - 2024 — McMap. All rights reserved.