Asked 29/4, 2015 at 21:55 Answered 26/12, 2015 at 10:45

Solved ruby-on-rails ruby multithreading ruby-on-rails-4 multiprocessing

I put together a simple example trying to prove concurrent requests in Rails using a basic example. Note that I am using MRI Ruby2 and Rails 4.2.

  def api_call
    sleep(10)
    render :json => "done"
  end

I then go to 4 different tabs in Chrome on my mac (I7 / 4 Core) and see if they get run in series or parallel (really concurrent which is close but not the same thing). i.e., http://localhost:3000/api_call

I cannot get this to work using Puma, Thin, or Unicorn. The requests each come by in series. First tab after 10 seconds, second after 20 (since it had to wait for the first to complete), third after that....

From what I have read, I believe the following to be true (please correct me) and were my results:

Unicorn is multiprocess and my example should have worked (after defining the number of workers in a unicorn.rb config file), but it didn't. I can see 4 workers starting but everything works in series. I am using the unicorn-rails gem, starting rails with unicorn -c config/unicorn.rb, and in my unicorn.rb I have:

-- unicorn.rb

worker_processes 4
preload_app true
timeout 30
listen 3000
after_fork do |server, worker|
  ActiveRecord::Base.establish_connection
end

Thin and Puma are multithreaded (although Puma at least has a 'clustered' mode where you can start workers with a -w parameter) and should not work anyways (in multithreaded mode) with MRI Ruby2.0 because "there is a Global Interpreter Lock (GIL) that ensures only one thread can be run at a time".

So,

Do I have a valid example (or is using sleep just wrong)?
Are my statements above about multiprocess and multithreaded (with respect to MRI Rails 2) correct?
Any ideas on why I can't get it working with Unicorn (or any server for that matter)?

There is a very similar question to mine but I can't get it working as answered and it doesn't answer all of my questions about concurrent requests using MRI Ruby.

Github project: https://github.com/afrankel/limitedBandwidth (note: project is looking at more than this question of multi-process/threading on the server)

Halstead answered 29/4, 2015 at 21:55 Comment(8)

I can't reproduce this with unicorn - everything is working as expected. – Queenie 29/4, 2015 at 22:37

@Queenie - I edited my post with my unicorn config. Do you see anything wrong with what I have listed? – Halstead 29/4, 2015 at 22:52

Update: Actually I did get it working. I had to increase my sleep to 60 seconds (my timeout to 180 in the unicorn config) and then the result was tab 1 returned in 1 min, 2,3,and 4 returned in 1.3 minutes. So maybe there is some delay in picking up a new worker. I would be interested if anyone could explain my result and also confirm my questions above. – Halstead 29/4, 2015 at 23:47

More results. I ran with the 60 second sleep using Thin and Puma and got 1,2,3,4 minute results (i.e., not processing concurrently). When I ran Puma in production (i.e., rails s Puma -e production). I did get 1,1.3,1.3,1.3 for the tabs so that worked as well. Again, I'm looking for explanations for my results. I believe the setting of cache_classes=true in production has something to do with my results. Why did Puma work since I was under the impression that multithreaded would not work. – Halstead 30/4, 2015 at 0:23

If you have an example accessible somewhere (Gibhub?) maybe we could check it out ourselves and provide furhter input. – Above 8/5, 2015 at 10:41

I created a github project with my example: github.com/afrankel/limitedBandwith Note that my main purpose is something different (to test limited bandwith situations from an AngularJS client) but the server piece is what I'm discussing in this post. Thanks for looking – Halstead 8/5, 2015 at 23:13

@ArthurFrankel I am having a look. I think of editing my answer with what I have found so far with some useful information. After your feedback we can improve the answer further, eventually share my modified version that I cloned from your GitHub. – Above 11/5, 2015 at 8:4

@Elyasin - minor note - I modified the name of my github repo since I realized I had a spelling error :) ! github.com/afrankel/limitedBandwidth – Halstead 12/5, 2015 at 16:43

I invite you to read the series of Jesse Storimer's Nobody understands the GIL It might help you understand better some MRI internals.

I have also found Pragmatic Concurrency with Ruby, which reads interesting. It has some examples of testing concurrently.

EDIT: In addition I can recommend the article Removing config.threadsafe! Might not be relevant for Rails 4, but it explains the configuration options, one of which you can use to allow concurrency.

Let's discuss the answer to your question.

You can have several threads (using MRI), even with Puma. The GIL ensures that only one thread is active at a time, that is the constraint that developers dub as restrictive (because of no real parallel execution). Bear in mind that GIL does not guarantee thread safety. This does not mean that the other threads are not running, they are waiting for their turn. They can interleave (the articles can help understanding better).

Let me clear up some terms: worker process, thread. A process runs in a separate memory space and can serve several threads. Threads of the same process run in a shared memory space, which is that of their process. With threads we mean Ruby threads in this context, not CPU threads.

In regards to your question's configuration and the GitHub repo you shared, I think an appropriate configuration (I used Puma) is to set up 4 workers and 1 to 40 threads. The idea is that one worker serves one tab. Each tab sends up to 10 requests.

So let's get started:

I work on Ubuntu on a virtual machine. So first I enabled the 4 cores in my virtual machine's setting (and some other settings of which I thought it might help). I could verify this on my machine. So I went with that.

Linux command --> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Stepping:              1
CPU MHz:               2306.141
BogoMIPS:              4612.28
L1d cache:             32K
L1d cache:             32K
L2d cache:             6144K
NUMA node0 CPU(s):     0-3

I used your shared GitHub project and modified it slightly. I created a Puma configuration file named puma.rb (put it in the config directory) with the following content:

workers Integer(ENV['WEB_CONCURRENCY'] || 1)
threads_count = Integer(ENV['MAX_THREADS'] || 1)
threads 1, threads_count

preload_app!

rackup      DefaultRackup
port        ENV['PORT']     || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
  # Worker specific setup for Rails 4.1+
  # See: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#on-worker-boot
  #ActiveRecord::Base.establish_connection
end

By default Puma is started with 1 worker and 1 thread. You can use environment variables to modify those parameters. I did so:

export MAX_THREADS=40
export WEB_CONCURRENCY=4

To start Puma with this configuration I typed

bundle exec puma -C config/puma.rb

in the Rails app directory.

I opened the browser with four tabs to call the app's URL.

The first request started around 15:45:05 and the last request was around 15h49:44. That is an elapsed time of 4 minutes and 39 seconds. Also you can see the request's id's in non sorted order in the log file. (See below)

Each API call in the GitHub project sleeps for 15 seconds. We have four 4 tabs, each with 10 API calls. That makes a maximum elapsed time of 600 seconds, i.e. 10 minutes (in a strictly serial mode).

The ideal result in theory would be all in parallel and an elapsed time not far from 15 seconds, but I did not expect that at all. I was not sure what to expect as a result exactly, but I was still positively surprised (considering that I ran on a virtual machine and MRI is restrained by the GIL and some other factors). The elapsed time of this test was less than half the maximum elapsed time (in strictly serial mode), we cut the result into less than half.

EDIT I read further about the Rack::Lock that wraps a mutex around each request (Third article above). I found the option config.allow_concurrency = true to be a time saver. A little caveat was to increase the connection pool (though the request do no query the database had to be set accordingly); the number of maximum threads is a good default. 40 in this case.

I tested the app with jRuby and the actual elapsed time was 2mins, with allow_concurrency=true.

I tested the app with MRI and the actual elapsed time was 1min47s, with allow_concurrency=true. This was a big surprise to me. This really surprised me, because I expected MRI to be slower than JRuby. It was not. This makes me questioning the widespread discussion about the speed differences between MRI and JRuby.

Watching the responses on the different tabs are "more random" now. It happens that tab 3 or 4 completes before tab 1, which I requested first.

I think because you don't have race conditions the test seems to be OK. However, I am not sure about the application wide consequences if you set config.allow_concurrency=true in a real world application.

Feel free to check it out and let me know any feedback you readers might have. I still have the clone on my machine. Let me know if you are interested.

To answer your questions in order:

I think your example is valid by result. For concurrency however, it is better to test with shared resources (as for example in the second article).
In regards to your statements, as mentioned in the beginning of this answer, MRI is multi-threaded, but restricted by GIL to one active thread at a time. This raises the question: With MRI isn't it better to test with more processes and less threads? I do not know really, a first guess would be rather no or not much of a difference. Maybe someone can shed light on this.
Your example is just fine I think. Just needed some slight modifications.

Appendix

Log file Rails app:

**config.allow_concurrency = false (by default)**
-> Ideally 1 worker per core, each worker servers up to 10 threads.

[3045] Puma starting in cluster mode...
[3045] * Version 2.11.2 (ruby 2.1.5-p273), codename: Intrepid Squirrel
[3045] * Min threads: 1, max threads: 40
[3045] * Environment: development
[3045] * Process workers: 4
[3045] * Preloading application
[3045] * Listening on tcp://0.0.0.0:3000
[3045] Use Ctrl-C to stop
[3045] - Worker 0 (pid: 3075) booted, phase: 0
[3045] - Worker 1 (pid: 3080) booted, phase: 0
[3045] - Worker 2 (pid: 3087) booted, phase: 0
[3045] - Worker 3 (pid: 3098) booted, phase: 0
Started GET "/assets/angular-ui-router/release/angular-ui-router.js?body=1" for 127.0.0.1 at 2015-05-11 15:45:05 +0800
...
...
...
Processing by ApplicationController#api_call as JSON
  Parameters: {"t"=>"15?id=9"}
Completed 200 OK in 15002ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[3075] 127.0.0.1 - - [11/May/2015:15:49:44 +0800] "GET /api_call.json?t=15?id=9 HTTP/1.1" 304 - 60.0230

**config.allow_concurrency = true**
-> Ideally 1 worker per core, each worker servers up to 10 threads.

[22802] Puma starting in cluster mode...
[22802] * Version 2.11.2 (ruby 2.2.0-p0), codename: Intrepid Squirrel
[22802] * Min threads: 1, max threads: 40
[22802] * Environment: development
[22802] * Process workers: 4
[22802] * Preloading application
[22802] * Listening on tcp://0.0.0.0:3000
[22802] Use Ctrl-C to stop
[22802] - Worker 0 (pid: 22832) booted, phase: 0
[22802] - Worker 1 (pid: 22835) booted, phase: 0
[22802] - Worker 3 (pid: 22852) booted, phase: 0
[22802] - Worker 2 (pid: 22843) booted, phase: 0
Started GET "/" for 127.0.0.1 at 2015-05-13 17:58:20 +0800
Processing by ApplicationController#index as HTML
  Rendered application/index.html.erb within layouts/application (3.6ms)
Completed 200 OK in 216ms (Views: 200.0ms | ActiveRecord: 0.0ms)
[22832] 127.0.0.1 - - [13/May/2015:17:58:20 +0800] "GET / HTTP/1.1" 200 - 0.8190
...
...
...
Completed 200 OK in 15003ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[22852] 127.0.0.1 - - [13/May/2015:18:00:07 +0800] "GET /api_call.json?t=15?id=10 HTTP/1.1" 304 - 15.0103

**config.allow_concurrency = true (by default)**
-> Ideally each thread serves a request.

Puma starting in single mode...
* Version 2.11.2 (jruby 2.2.2), codename: Intrepid Squirrel
* Min threads: 1, max threads: 40
* Environment: development
NOTE: ActiveRecord 4.2 is not (yet) fully supported by AR-JDBC, please help us finish 4.2 support - check http://bit.ly/jruby-42 for starters
* Listening on tcp://0.0.0.0:3000
Use Ctrl-C to stop
Started GET "/" for 127.0.0.1 at 2015-05-13 18:23:04 +0800
Processing by ApplicationController#index as HTML
  Rendered application/index.html.erb within layouts/application (35.0ms)
...
...
...
Completed 200 OK in 15020ms (Views: 0.7ms | ActiveRecord: 0.0ms)
127.0.0.1 - - [13/May/2015:18:25:19 +0800] "GET /api_call.json?t=15?id=9 HTTP/1.1" 304 - 15.0640

Above answered 7/5, 2015 at 9:56 Comment(4)

Thank you! Let me chew on this and respond soon. – Halstead 12/5, 2015 at 16:35

On a side note, which is out of the scope of your question: I ported your example to run in a jRuby environment. The test took 2 minutes (and 1 second). – Above 13/5, 2015 at 8:40

@ArthurFrankel I edited the response with some new insights when setting allow_concurrency=true with JRuby and MRI. – Above 13/5, 2015 at 10:43

this is great stuff! As a suggestion maybe you can add a table with your results with altering the number of workers and threads and with allow_concurrency on/off. For example: 1 thread/4 workers (since you have 4 cores); 10 threads/1 worker; ....and so forth. Then maybe also change 15 seconds to 60 seconds. I have noticed in my tests that 60 seconds allows workers to start up and kick in to be used. All suggestions. Also, if you ever want to move to DC (Northern Virginia) you have a job! – Halstead 13/5, 2015 at 16:46

To both @Elyasin and @Arthur Frankel, I created this repo for testing Puma running in MRI and JRuby. In this small project, I didn't do sleep to emulate a long running request. As I found that in MRI, the GIL appears to treat that differently than regular processing, more similarly as an external I/O request.

I put fibonacci sequence calculation in the controller. On my machine, the fib(39) took 6.x seconds in JRuby, and 11 seconds in MRI, which is enough to show the differences.

I opened 2 browser windows. Instead of opening tabs in the same browser, I did this to prevent certain restrictions of the concurrent request that a browser send to the same domain. I'm now sure the details, but 2 different browsers is suffice to prevent that from happening.

I tested thin + MRI, and Puma + MRI, then Puma + JRuby. The results are:

thin + MRI: not surprised, when I quickly reloaded the 2 browsers, the first one finished after 11 seconds. Then the second request started, took another 11 seconds to finish.
Let's talk about Puma + JRuby first. As I quickly reloaded the 2 browsers, they appeared to start nearly at the same second, and finished at the same second, too. Both took around 6.9 seconds to finish. Puma is a multi-thread server and JRuby supports multi-threading.
Finally Puma + MRI. It took 22 seconds to finish for both browsers after I quickly reloaded the 2 browsers. They started nearly at the same second, finished nearly at the same second as well. But it took twice time for both to finish. That's exactly what GIL does: switching between the threads for concurrency, but the lock itself prevents parallelism from happening.

About my setup:

Servers were all launched in Rails production mode. In production mode, config.cache_classes is set to true, which implies config.allow_concurrency = true
Puma was started with 8 threads min and 8 threads max.

Delgadillo answered 26/12, 2015 at 10:45 Comment(0)

Appendix

Recommended topics

Hot tags