I have a large CSV file - almost 28 million rows and 57 columns - 8.21GB - the data is of different types - integers, strings, floats - but nothing unusual.
When I load it in Python/Pandas it takes 161 seconds, using the following code.
df = pd.read_csv("file.csv", header=0, low_memory=False)
In Julia, it takes a little longer - over an hour. UPDATE: I am not sure why, but when I ran the code this morning (twice to check), it took around 702 and 681 seconds. This much better than an hour, but it is still way slower than Python.
My Julia code is also pretty simple:
df = CSV.File("file.csv") |> DataFrame
Am I doing something wrong? Is there something I can do to speed it up? Or is this just the price you pay to play with Julia?
julia -t 8
, which will result in CSV.jl using 8 threads by default to parse the file. – Hobble