time taken to read a large CSV file in Julia

About

Asked 20/8, 2020 at 11:42 Answered 18/2, 2023 at 17:55

I have a large CSV file - almost 28 million rows and 57 columns - 8.21GB - the data is of different types - integers, strings, floats - but nothing unusual.

When I load it in Python/Pandas it takes 161 seconds, using the following code.

df = pd.read_csv("file.csv", header=0, low_memory=False)

In Julia, it takes a little longer - over an hour. UPDATE: I am not sure why, but when I ran the code this morning (twice to check), it took around 702 and 681 seconds. This much better than an hour, but it is still way slower than Python.

My Julia code is also pretty simple:

df = CSV.File("file.csv") |> DataFrame

Am I doing something wrong? Is there something I can do to speed it up? Or is this just the price you pay to play with Julia?

Tetrad answered 20/8, 2020 at 11:42 Comment(7)

This is unexpected - it should be much faster than 1 hour. What version of Julia and CSV.jl are you on and how many threads are you using? – Overdraw 20/8, 2020 at 13:3

Julia 1.5.0 CSV v0.7.7 DataFrames v0.21.6 – Tetrad 20/8, 2020 at 21:43

Can you share the file at all, even if just privately with me? (primary CSV.jl author). Happy to take a look and see what might be causing the slowdown. Are you starting julia with multiple threads? You can do this like julia -t 8, which will result in CSV.jl using 8 threads by default to parse the file. – Hobble 21/8, 2020 at 13:29

Thanks - increasing the number of threads to 6 (my machine has 6 cores) reduced the read time to 228 seconds. I am sorry, but I cannot share the file. Thanks again for the suggestion, most helpful. – Tetrad 21/8, 2020 at 22:21

How much RAM do you have? I can load a 13G file in about 80s. – Darmit 28/8, 2020 at 14:25

I have 64GB - should be plenty enough – Tetrad 30/8, 2020 at 4:51

can you share the file after making all the data random but keep the same type? – Jemina 24/1, 2023 at 17:3

From the CSV.jl documentation:

In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made.

so you could try

CSV.read("file.csv", DataFrame)

Fetish answered 18/2, 2023 at 17:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags