Julia: use of pmap with Arrays vs SharedArrays
Asked Answered
D

1

8

I have been working in Julia for a few months now and I am interested in writing some of my code in parallel. I am working on a problem where I use 1 model to generate data for several different receivers (the data for each receiver is a vector). The data for each receiver can be computed independently which leads me to believe I should be able to use the pmap function. My plan is to initialize the data as a 2D SharedArray (each column represents the data for 1 receiver) and then have pmap loop over each of the columns. However I am finding that using SharedArray's with pmap is no faster than working in serial using map. I wrote the following dummy code to illustrate this point.

@everywhere function Dummy(icol,model,data,A,B)
    nx = 250
    nz = 250
    nh = 50
    for ih = 1:nh
        for ix = 1:nx
            for iz = 1:nz
                data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
            end
        end
    end
end


function main()

    nx = 250
    nz = 250
    nh = 50

    nt = 500
    ncol = 100

    model1 = rand(nz,nx,nh)
    model2 = copy(model1)
    model3 = convert(SharedArray,model1)

    data1 = zeros(Float64,nt,ncol)
    data2 = SharedArray(Float64,nt,ncol)
    data3 = SharedArray(Float64,nt,ncol)

    A1 = rand(nz,nx,nh)
    A2 = copy(A1)
    A3 = convert(SharedArray,A1)

    B1 = rand(nz,nx,nh)
    B2 = copy(B1)
    B3 = convert(SharedArray,B1)


    @time map((arg)->Dummy(arg,model1,data1,A1,B1),[icol for icol = 1:ncol])
    @time pmap((arg)->Dummy(arg,model2,data2,A2,B2),[icol for icol = 1:ncol])
    @time pmap((arg)->Dummy(arg,model3,data3,A3,B3),[icol for icol = 1:ncol])

    println(data1==data2)
    println(data1==data3)

end

main() 

I start the Julia session with Julia -p 3 and run the script. The times for the 3 tests are 1.4s, 4.7s, and 1.6s respectively. Using SharedArrays with pmap (1.6s runtime) hasn't provided any improvement in speed compared with regular Arrays with map (1.4s). I am also confused as to why the 2nd case (data as a SharedArray, all other inputs as a regular Array with pmap) is so slow. What do I need to change in order to benefit from working in parallel?

Doerr answered 20/8, 2016 at 21:31 Comment(0)
T
6

Preface: yes, there actually is a solution to your issue. See code at bottom for that. But, before I get there, I'll go into some explanation.

I think the root of the problem here is memory access. First off, although I haven't rigorously investigated it, I suspect that there are a moderate number of improvements that could be made to Julia's underlying code in order to improve the way that it handles memory access in parallel processing. Nevertheless, in this case, I suspect that any underlying issues with the base code, if such actually exist, aren't so much at fault. Instead, I think it is useful to think carefully about what exactly is going on in your code and what it means vis-a-vis memory access.

  1. A key thing to keep in mind when working in Julia is that it stores Arrays in column-major order. That is, it stores them as stacks of columns on top of each other. This generalizes to dimensions > 2 as well. See this very helpful segment of the Julia performance tips for more info. The implication of this is that it is fast to access one row after another after another in a single column. But, if you need to be jumping around columns, then you get into trouble. Yes, accessing ram memory might be relatively fast, but accessing cache memory is much, much faster, and so if your code allows for a single column or so to be loaded from ram into cache and then worked on, then you'll do much better than if you need to be doing lots of swapping between ram and cache. Here in your code, you're switching from column to column between your computations like nobody's business. For instance, in your pmap each process gets a different column of the shared array to work on. Then, each goes down the rows of that column and modifies the values in it. But, since they are trying to work in parallel with one another, and the whole array is too big to fit into your cache, there is lots of swapping between ram and cache that happens and that really slows you down. In theory, perhaps a clever enough under-the-hood memory management system could be devised to address this, but I don't really know - that goes beyond my pay grade. The same thing of course is happening to your accesses to your other objects.

  2. Another thing to keep in mind in general when parallelizing is your ratio of flops (i.e. computer calculations) to read/write operations. Flops tend to parallelize well, you can have different cores, processes, etc. doing their own little computations on their own bits of data that they hold in their tiny caches. But, read/write operations don't parallelize so well. There are some things that can be done to engineer hardware systems to improve on this. But in general, if you have a given computer system with say, two cores, and you add four more cores to it, your ability to perform flops will increase three fold, but your ability to read/write data to/from ram won't really improve so much. (note: this is an oversimplication, a lot depends on your system). Nevertheless, in general, the higher your ratio of flops to read/writes, the more benefits you can expect from parallelism. In your case, your code involves a decent number of read/writes (all of those accesses to your different arrays) for a relatively small number of flops (a few multiplactions and an addition). It's just something to keep in mind.

  3. Fortunately, your case is amenable to some good speedups from parallelism if written correctly. In my experience with Julia, all of my most successful parallelism comes when I can break data up and have workers process chunks of it separately. Your case happens to be amenable to that. Below is an example of some code I wrote that does that. You can see that it gets nearly a 3x increase in speed going from one processor to three. The code a bit crude in places, but it at least demonstrates the general idea of how something like this could be approached. I give a few comments on the code afterwards.

addprocs(3)

nx = 250;
nz = 250;
nh = 50;
nt = 250;
@everywhere ncol = 100;

model = rand(nz,nx,nh);

data = SharedArray(Float64,nt,ncol);

A = rand(nz,nx,nh);

B = rand(nz,nx,nh);

function distribute_data(X, obj_name_on_worker::Symbol, dim)
    size_per_worker = floor(Int,size(X,1) / nworkers())
    StartIdx = 1
    EndIdx = size_per_worker
    for (idx, pid) in enumerate(workers())
        if idx == nworkers()
            EndIdx = size(X,1)
        end
        println(StartIdx:EndIdx)
        if dim == 3
            @spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:,:])))
        elseif dim == 2
            @spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:])))
        end
        StartIdx = EndIdx + 1
        EndIdx = EndIdx + size_per_worker - 1
    end
end

distribute_data(model, :model, 3)
distribute_data(A, :A, 3)
distribute_data(B, :B, 3)
distribute_data(data, :data, 2)

@everywhere function Dummy(icol,model,data,A,B)
    nx = size(model, 2)
    nz = size(A,1)
    nh = size(model, 3)
    for ih = 1:nh
        for ix = 1:nx
            for iz = 1:nz
                data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
            end
        end
    end
end

regular_test() = map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])

function parallel_test()
    @everywhere begin
        if myid() != 1
            map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
        end
    end
end

@time regular_test(); # 2.120631 seconds (307 allocations: 11.313 KB)
@time parallel_test(); # 0.918850 seconds (5.70 k allocations: 337.250 KB)

getfrom(p::Int, nm::Symbol; mod=Main) = fetch(@spawnat(p, getfield(mod, nm)))
function recombine_data(Data::Symbol)
    Results = cell(nworkers())
    for (idx, pid) in enumerate(workers())
        Results[idx] = getfrom(pid, Data)
    end
    return vcat(Results...)
end

@time P_Data = recombine_data(:data); # 0.003132 seconds

P_Data == data  ## true

Comments

  • The use of the SharedArray is quite superfluous here. I just used it since it lends itself easily to being modified in place, which is how your code is originally written. This let me work more directly based on what you wrote without modifying it as much.

  • I didn't include the step to bring the data back in the time trial, but as you can see, it's quite a trivial amount of time in this case. In other situations, it might be less trivial, but data movement is just one of those issues that you face with parallelism.

  • When doing time trials in general, it is considered best practice to run the function once (in order to compile the code) and then run it again to get the times. That's what I did here.

  • See this SO post for where I got inspiration for some of these functions that I used here.

Touchhole answered 20/8, 2016 at 23:43 Comment(11)
This appears to achieve what I was hoping for. I do have a couple newbie questions for clarification as I am quite new to many of these concepts. First, I am somewhat confused by the use of obj_name_on_worker::Symbol. Is it necessary to use a symbol for this purpose, or could the same thing be achieved by using an array on each of the workers? Second, I am confused as to why distribute_data needs to be run for all of the arrays (rather than just the data). Intuitively (and probably naively) I would think that each worker would need access to the full model, A, and B.Doerr
@user6443519 Sure thing. As to your first question, I'm working on getting around to writing up a fuller explanation in the Julia docs, but haven't quite gotten to it. The gist though is that the distribute function is really just a "declare" function - i.e. it isn't so much moving data as assigning a symbol in each worker's space to be associated with data that already exists. See my writeup in the documentation on spawnat for a bit more insight into this, and I'll let you know once I finish my more detailed writeup.Touchhole
@user6443519 If I'm not mistaken, it looks like you might be coming from a C background? You can thus think about it is a bit analogous to just giving each worker a new pointer to the array. But, the takeaway is that once it is run, the arrays are accessible on the workers just like any regular array or other object is - the syntax in the function is just the means to get to that end.Touchhole
@user6443519 As per your second question, the reason I distributed the other objects (A, B, etc.) is (1) because it is feasible, since each worker only needs to access a portion of the object in order to complete all of its task, and (2), so that you don't run into that same thing with competing workers trying to switch back and forth in their memory accesses to the parts of, e.g. object A in different columns near simultaneously. Let me know if these make sense and get at your questions.Touchhole
@user6443519 It's possible that you could get away without distributing the arrays, I didn't specifically try, and it doesn't hurt to experiment a bit. I suspect though that the memory access issue would cause problems if not doing that.Touchhole
@user6443519 As per your first - more than anything, it just comes down to the specifics of the spawnat macro, and of macros in general, which take symbols and expressions as their arguments, rather than objects. See my writeup here for a bit more on the macro thing: #38987264Touchhole
Thanks for the thorough explanation. I have a much better understanding of what is going on now.Doerr
@user6443519 Glad to hear it!Touchhole
Actually there is 1 more issue I am running into. Ultimately I intend to place all of these functions into a module so that I can later load them in and pass any arbitrary model, A and B as inputs; I posted an Edit that shows my attempt to pack these functions into a module. However I run into an issue with parallel_test where Dummy is not defined on the workers.Doerr
@user6443519 In that case, I recommend asking and framing a new question about getting multiple workers access to functions in a module. I'll take a look at it when you do so. It's better that way than to tack too many things into a single question, since smaller questions will make it easier for future people to find what they are looking for.Touchhole
Thanks for the suggestion. I have started a new question hereDoerr

© 2022 - 2024 — McMap. All rights reserved.