Parallelism with @sync @async in Julia
Asked Answered
T

1

3

I have some heavy csv table that i would like to import in parallel with @sync @sync macros. Not very familiar to this, I tried this way :

#import files
@sync @async begin
    df1=CSV.File(libname*"df1.csv")|> DataFrame!
    df2=CSV.File(libname*"df2.csv")|> DataFrame!
end

I have the task done, but the data subset I make after seems to be impacted :

select!(df1, Not("Var1"))

ArgumentError : Column :Var1 not found in the data frame

PS : without @sync macro the code works well

I probably make something wrong. Any idea would be helpful. Thanks

Tradein answered 22/10, 2020 at 19:30 Comment(0)
A
2

@sync @async do not do anything in your code other than introducing a begin... end block with its local scope.

What happens here is that you are creating a new scope and never modify the global values of df1 and df2 - rather than that you are seeing their old values.

If I/O is the bottleneck in your code the correct code would be the following:

dfs = Vector{DataFrame}(undef, 2)
@sync begin
    @async dfs[1]=CSV.File(libname*"df1.csv")|> DataFrame!
    @async dfs[2]=CSV.File(libname*"df2.csv")|> DataFrame!
end

However, usually it is not the I/O that is the issue but rather the CPU. In that case green threads are not that much useful and you need normal regular threads:

dfs = Vector{DataFrame}(undef, 2)
Threads.@threads for i in 1:2
    dfs[i]=CSV.File(libname*"df$i.csv")|> DataFrame!
end

Note that for this code to use multi-threading you need to set the JULIA_NUM_THREADS system variable before running Julia such as:

set JULIA_NUM_THREADS=2
Androgen answered 22/10, 2020 at 19:32 Comment(1)
Thanks, very useful to me.Tradein

© 2022 - 2024 — McMap. All rights reserved.