How many threads does Clojure's pmap function spawn for URL-fetching operations?
Asked Answered
G

4

22

The documentation on the pmap function leaves me wondering how efficient it would be for something like fetching a collection of XML feeds over the web. I have no idea how many concurrent fetch operations pmap would spawn and what the maximum would be.

Gallaway answered 16/2, 2011 at 20:33 Comment(0)
H
25

If you check the source you see:

> (use 'clojure.repl)
> (source pmap)
(defn pmap
  "Like map, except f is applied in parallel. Semi-lazy in that the
  parallel computation stays ahead of the consumption, but doesn't
  realize the entire result unless required. Only useful for
  computationally intensive functions where the time of f dominates
  the coordination overhead."
  {:added "1.0"}
  ([f coll]
   (let [n (+ 2 (.. Runtime getRuntime availableProcessors))
         rets (map #(future (f %)) coll)
         step (fn step [[x & xs :as vs] fs]
                (lazy-seq
                 (if-let [s (seq fs)]
                   (cons (deref x) (step xs (rest s)))
                   (map deref vs))))]
     (step rets (drop n rets))))
  ([f coll & colls]
   (let [step (fn step [cs]
                (lazy-seq
                 (let [ss (map seq cs)]
                   (when (every? identity ss)
                     (cons (map first ss) (step (map rest ss)))))))]
     (pmap #(apply f %) (step (cons coll colls))))))

The (+ 2 (.. Runtime getRuntime availableProcessors)) is a big clue there. pmap will grab the first (+ 2 processors) pieces of work and run them asynchronously via future. So if you have 2 cores, it's going to launch 4 pieces of work at a time, trying to keep a bit ahead of you but the max should be 2+n.

future ultimately uses the agent I/O thread pool which supports an unbounded number of threads. It will grow as work is thrown at it and shrink if threads are unused.

Hallerson answered 16/2, 2011 at 22:19 Comment(4)
So is the short answer that pmap is perfectly fine for dispatching a lot of web calls and processing the responses? Are there any caveats?Gallaway
I may be wrong, but the issue will probably be that the n+2 threads will block waiting for web responses. So you won't get enough in-flight requests for maximum throughput - pmap is really intended for CPU-bound workloads. If this is happening to you, then you can just wrap each request call in a future and they will all fly off at once.Orlena
Well there's never a short answer with concurrency. :) I'd say that pmap is not actually ideal for this use case. You really want to wait for all of the sources in parallel - pmap will delay starting the 5th one in the case above. UNLESS, you don't necessarily want to get through all your sources, in which case pmap's lazy behavior is good. I would be tempted for your stuff to instead map over sources and use future to make each request.Hallerson
I wonder why wasn't a fn argument added in pmap, to control the amount of threads in different ways (for example for cases where the memory consumed by each worker is a concern, and also in general).Lazar
O
12

Building on Alex's excellent answer that explains how pmap works, here's my suggestion for your situation:

(doall
  (map
    #(future (my-web-fetch-function %))
    list-of-xml-feeds-to-fetch))

Rationale:

  • You want as many pieces of work in-flight as you can, since most will block on network IO.
  • Future will fire off an asynchronous piece of work for each request, to be handled in a thread pool. You can let Clojure take care of that intelligently.
  • The doall on the map will force the evaluation of the full sequence (i.e. the launch of all the requests).
  • Your main thread can start dereferencing the futures right away, and can therefore continue making progress as the individual results come back
Orlena answered 17/2, 2011 at 0:29 Comment(2)
I think futures use an unbounded thread pool so running this on a large collection of feeds could cause problems.Moses
Also, you'd probably want to map over that again to deref the futures so you know when everything's finished.Brackish
N
3

No time to write a long response, but there's a clojure.contrib http-agent which creates each get/post request as its own agent. So you can fire off a thousand requests and they'll all run in parallel and complete as the results come in.

Northeastward answered 22/2, 2011 at 11:14 Comment(0)
F
2

Looking the operation of pmap, it seems to go 32 threads at a time no mater what number of processors you have, the issue is that map will go ahead of the computation by 32 and the futures are started in their own. (SAMPLE) (defn samplef [n] (println "starting " n) (Thread/sleep 10000) n) (def result (pmap samplef (range 0 100)))

; you will wait for 10 seconds and see 32 prints then when you take the 33rd an other 32 ; prints this mins that you are doing 32 concurrent threads at a time ; to me this is not perfect ; SALUDOS Felipe

Fugleman answered 4/4, 2014 at 16:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.