How to use ruby fibers to avoid blocking IO
Asked Answered
S

5

7

I need to upload a bunch of files in a directory to S3. Since more than 90% of the time required to upload is spent waiting for the http request to finish, I want to execute several of them at once somehow.

Can Fibers help me with this at all? They are described as a way to solve this sort of problem, but I can't think of any way I can do any work while an http call blocks.

Any way I can solve this problem without threads?

Shanks answered 3/3, 2010 at 0:15 Comment(1)
So, can anyone comment on fibers themselves? Am I correct in assuming that fibers have no "do stuff in the background" powers?Shanks
B
2

I'm not up on fibers in 1.9, but regular Threads from 1.8.6 can solve this problem. Try using a Queue http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html

Looking at the example in the documentation, your consumer is the part that does the upload. It 'consumes' a URL and a file, and uploads the data. The producer is the part of your program that keeps working and finds new files to upload.

If you want to upload multiple files at once, simply launch a new Thread for each file:

t = Thread.new do
  upload_file(param1, param2)
end
@all_threads << t

Then, later on in your 'producer' code (which, remember, doesn't have to be in its own Thread, it could be the main program):

@all_threads.each do |t|
  t.join if t.alive?
end

The Queue can either be a @member_variable or a $global.

Bauhaus answered 3/3, 2010 at 9:42 Comment(1)
Although my fiber questino seems to have gone unansweredShanks
I
2

To answer your actual questions:

Can Fibers help me with this at all?

No they can't. Jörg W Mittag explains why best.

No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)

The only concurrency construct in Ruby is Thread.

When he says that the only concurrency contruct in Ruby is Thread, remember that there are many different implimentations of Ruby and that they vary in their threading implementations. Jörg once again provides a great answer to these differences; and correctly concludes that only something like JRuby (that uses JVM threads mapped to native threads) or forking your process is how you can achieve true parallelism.

Any way I can solve this problem without threads?

Other than forking your process, I would also suggest that you look at EventMachine and something like em-http-request. It's an event driven, non-blocking, reactor pattern based HTTP client that is asynchronous and does not incur the overhead of threads.

Inlet answered 27/2, 2011 at 4:19 Comment(2)
See Synchronizing with Multi interface at github.com/igrigorik/em-http-request/wiki/Parallel-RequestsPriority
Yes, they can. You may need a way to do callbacks, like Eventmachine or Celluloid. Here's an implementation: github.com/igrigorik/em-http-requestHolocene
F
2

Aaron Patterson (@tenderlove) uses an example almost exactly like yours to describe exactly why you can and should use threads to achieve concurrency in your situation.

Most I/O libraries are now smart enough to release the GVL (Global VM Lock, or most people know it as the GIL or Global Interpreter Lock) when doing IO. There is a simple function call in C to do this. You don't need to worry about the C code, but for you this means that most IO libraries worth their salt are going to release the GVL and allow other threads to execute while the thread that is doing the IO waits for the data to return.

If what I just said was confusing, you don't need to worry about it too much. The main thing that you need to know is that if you are using a decent library to do your HTTP requests (or any other I/O operation for that matter... database, interprocess communication, whatever), the Ruby interpreter (MRI) is smart enough to be able to release the lock on the interpreter and allow other threads to execute while one thread awaits IO to return. If the next thread has its own IO to grab, the Ruby interpreter will do the same thing (assuming that the IO library is built to utilize this feature of Ruby, which I believe most are these days).

So, to sum up what I am saying, use threads! You should see the performance benefit. If not, check to see whether your http library is using the rb_thread_blocking_region() function in C and, if not, find out why not. Maybe there is a good reason, maybe you need to consider using a better library.

The link to the Aaron Patterson video is here: http://www.youtube.com/watch?v=kufXhNkm5WU

It is worth a watch, even if just for the laughs, as Aaron Patterson is one of the funniest people on the internet.

Fistulous answered 24/4, 2013 at 20:57 Comment(0)
C
1

This is 2023, and things have changed. You can use Fibers to solve this issue!

You need:

Example of usage:

require "fiber_scheduler"
require "open-uri"

FiberScheduler do
  Fiber.schedule do
    URI.open("https://httpbin.org/delay/2")
  end

  Fiber.schedule do
    URI.open("https://httpbin.org/delay/2")
  end
end

That's it! Please note, that both fibers in the FiberScheduler block will be executed simultaneously.

If you need to return a value, you can do this:

def async_request(url)
  response = nil

  FiberScheduler do
      Fiber.schedule do
        response = URI.open(url)
      end
    
      Fiber.schedule do
        until response
          print "not ready yet, waiting, or doing some stuff"
          sleep 5
        end
      end
    end

   response
end

response = async_request("https://httpbin.org/delay/2")
Casein answered 12/10, 2023 at 0:55 Comment(0)
M
0

You could use separate processes for this instead of threads:

#!/usr/bin/env ruby

$stderr.sync = true

# Number of children to use for uploading
MAX_CHILDREN = 5

# Hash of PIDs for children that are working along with which file
# they're working on.
@child_pids = {}

# Keep track of uploads that failed
@failed_files = []

# Get the list of files to upload as arguments to the program
@files = ARGV


### Wait for a child to finish, adding the file to the list of those
### that failed if the child indicates there was a problem.
def wait_for_child
    $stderr.puts "    waiting for a child to finish..."
    pid, status = Process.waitpid2( 0 )
    file = @child_pids.delete( pid )
    @failed_files << file unless status.success?
end


### Here's where you'd put the particulars of what gets uploaded and
### how. I'm just sleeping for the file size in bytes * milliseconds
### to simulate the upload, then returning either +true+ or +false+
### based on a random factor.
def upload( file )
    bytes = File.size( file )
    sleep( bytes * 0.00001 )
    return rand( 100 ) > 5
end


### Start a child uploading the specified +file+.
def start_child( file )
    if pid = Process.fork
        $stderr.puts "%s: uploaded started by child %d" % [ file, pid ]
        @child_pids[ pid ] = file
    else
        if upload( file )
            $stderr.puts "%s: done." % [ file ]
            exit 0 # success
        else
            $stderr.puts "%s: failed." % [ file ]
            exit 255
        end
    end
end


until @files.empty?

    # If there are already the maximum number of children running, wait 
    # for one to finish
    wait_for_child() if @child_pids.length >= MAX_CHILDREN

    # Start a new child working on the next file
    start_child( @files.shift )

end


# Now we're just waiting on the final few uploads to finish
wait_for_child() until @child_pids.empty?

if @failed_files.empty?
    exit 0
else
    $stderr.puts "Some files failed to upload:",
        @failed_files.collect {|file| "  #{file}" }
    exit 255
end
Morra answered 3/3, 2010 at 6:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.