I have a large simulation on a DataFrame df
which I am trying to parallelize and save the results of the simulations in a DataFrame called simulation_results
.
The parallelization loop is working just fine. The problem is that if I were to store the results in an array I would declare it as a SharedArray
before the loop. I don't know how to declare simulation_results
as a "shared DataFrame" which is available everywhere to all processors and can be modified.
A code snippet is as follows:
addprocs(length(Sys.cpu_info()))
@everywhere begin
using <required packages>
df = CSV.read("/path/data.csv", DataFrame)
simulation_results = similar(df, 0) #I need to declare this as shared and modifiable by all processors
nsims = 100000
end
@sync @distributed for sim in 1:nsims
nsim_result = similar(df, 0)
<the code which for one simulation stores the results in nsim_result >
append!(simulation_results, nsim_result)
end
The problem is that since simulation_results
is not declared to be shared and modifiable by processors, after the loop runs, it produces basically an empty DataFrame as was coded in @everywhere simulation_results = similar(df, 0)
.
Would really appreciate any help on this! Thanks!