time taken to read a large CSV file in Julia
Asked Answered
T

1

9

I have a large CSV file - almost 28 million rows and 57 columns - 8.21GB - the data is of different types - integers, strings, floats - but nothing unusual.

When I load it in Python/Pandas it takes 161 seconds, using the following code.

df = pd.read_csv("file.csv", header=0, low_memory=False)

In Julia, it takes a little longer - over an hour. UPDATE: I am not sure why, but when I ran the code this morning (twice to check), it took around 702 and 681 seconds. This much better than an hour, but it is still way slower than Python.

My Julia code is also pretty simple:

df = CSV.File("file.csv") |> DataFrame

Am I doing something wrong? Is there something I can do to speed it up? Or is this just the price you pay to play with Julia?

Tetrad answered 20/8, 2020 at 11:42 Comment(7)
This is unexpected - it should be much faster than 1 hour. What version of Julia and CSV.jl are you on and how many threads are you using?Overdraw
Julia 1.5.0 CSV v0.7.7 DataFrames v0.21.6Tetrad
Can you share the file at all, even if just privately with me? (primary CSV.jl author). Happy to take a look and see what might be causing the slowdown. Are you starting julia with multiple threads? You can do this like julia -t 8, which will result in CSV.jl using 8 threads by default to parse the file.Hobble
Thanks - increasing the number of threads to 6 (my machine has 6 cores) reduced the read time to 228 seconds. I am sorry, but I cannot share the file. Thanks again for the suggestion, most helpful.Tetrad
How much RAM do you have? I can load a 13G file in about 80s.Darmit
I have 64GB - should be plenty enoughTetrad
can you share the file after making all the data random but keep the same type?Jemina
F
1

From the CSV.jl documentation:

In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made.

so you could try

CSV.read("file.csv", DataFrame)
Fetish answered 18/2, 2023 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.