Is there a faster way than fread() to read big data?

S

2

9

Hi first of all I already search on stack and google and found posts such at this one : Quickly reading very large tables as dataframes. While those are helpfull and well answered, I'm looking for more informations.

I am looking for the best way to read/import "big" data that can go up to 50-60GB. I am currently using the fread() function from data.table and it is the function that is the fastest I know at the moment. The pc/server I work on got a good cpu (work station) and 32 GB RAM, but still datas over 10GB and sometimes near billions observations takes a lot of time to get read.

We already have sql databases but for some reasons we have to work with big data in R. Is there a way to speed up R or an even better option than fread() when it comes to huge file like this?

Thank you.

Edit : fread("data.txt", verbose = TRUE)

omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=2, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C://somefolder/data.txt
  File opened, size = 1.083GB (1163081280 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 5 fields using quote rule 0
  Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
  Type codes (jump 000)    : 5A5AA  Quote rule 0
  Type codes (jump 100)    : 5A5AA  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1163081249
  Line length: mean=56.72 sd=20.65 min=25 max=128
  Estimated number of rows: 1163081249 / 56.72 = 20506811
  Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
  Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
  jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         3 : string    'A'
=============================
   0.007s (  0%) Memory map 1.083GB file
   0.739s (  2%) sep=',' ncol=5 and header detection
   0.001s (  0%) Column type detection using 10054 sample rows
   1.809s (  6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
  28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
   +   26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
   +    2.639s (  8%) Transpose
   +    0.035s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  31.484s        Total

Somatist answered 31/5, 2019 at 14:15 Comment(7)

do you really need all data in R? I recommend to transform, filter or create subsets beforehand using e.g. awk, sed and/or cat in a unix environment. Another approach would be to read junks of the data using furrr:future_map to parallelize. – Tradein 31/5, 2019 at 14:21

...or since you already have your data in a sql db, just connect to that and pull in subsamples to work with. – Vallo 31/5, 2019 at 14:21

If you know dimensions of your data set in advance, you could preallocate needed space and write Rccp function by yourself (for import) and it should be a little bit faster (but do not expect big improvement). – Mcmath 31/5, 2019 at 14:24

@Jimbou Thanks I'll take a look at furrr:future_map. @Vallo It is unpractical but I can't connect directly to the sql db, that is why I'm asking this here. @Mcmath Thank you, already try it but it didn't make it much faster! – Somatist 31/5, 2019 at 14:31

You mention that your workstation has a good cpu and 32 gb of memory, you do not state anything about the storage subsystem if it is SSD, HDD. SDD would be much better then HDD of course. Even faster then most SSD would be use of Intel Optane memory. Given the size of the dataset you are working with I would up the system memory to 64 GB. – Currajong 31/5, 2019 at 14:41

@BruceSchardt The data are stored on a HDD but indeed a SSD would be better. The cpu is a E5-2699A v4. Also yes we ask the IT to upgrade to 64GB ram, still waiting for it tho. Thank you! – Somatist 31/5, 2019 at 14:45

vroom is much faster than fread for many files, with the caveat that it doesn't actually read all the data into memory. – Fabiano 23/7, 2019 at 19:38

N

4

You can use select = columns to only load the relevant columns without saturating your memory. For example:

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

I used read.delim() to read a file that fread() could not load completely. So you could convert your data into .txt and use read.delim().

However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc) and write your query like you normally would. You can optimize your memory usage that way.

Check out this short introduction to odbc.

Nutritious answered 31/5, 2019 at 14:42 Comment(5)

Yes I already used odbc, I actually have not use select = columns, I should have think of that. Also like I said in the comments I can't directly connect to the SQL servers (which is why I am asking this question). I know it is really unpractical but I have to do it in R. I will accept your answer since it seems fread() is still the fastest option inside of R if I can't connect to the SQL servers directly. Thanks! – Somatist 31/5, 2019 at 14:51

Glad I could help. Alternatively, you can try loading it with pandas in Python. Personally, I think data.table is the best package in terms of syntax, but pandas reads files very quickly. It also has a usecols argument. For example: pd,read_csv("./file.csv", usecols = ["column1", "column2"]). – Nutritious 31/5, 2019 at 15:14

Good idea, I don't know much about pandas but it's worth to try it. I'll try to run a benchmark between those two. Thanks again! – Somatist 31/5, 2019 at 15:16

I use regularly fread on 50GB and it works very well. Why you think it will not work well with this size? It was designed for such big datasets, and even bigger. Pandas won't help, it cannot load 50GB csv on 128GB machine due to high memory usage in pandas. Eventually python datatable might help there. – Beautify 1/6, 2019 at 5:29

@Jangorecki I did not know datatable was made for such big datasets. I said so because I had problems loading a file recently. data.table could not load all the rows. – Nutritious 2/6, 2019 at 12:53

B

5

Assuming you want your file fully read into R, using database or choosing subset of columns/rows won't be much helpful.

What can be helpful in such case is to:
- ensure that you are using recent version of data.table
- ensure that optimal number of threads is set
use setDTthreads(0L) to use all available threads, by default data.table uses 50% of available threads.
- check output of fread(..., verbose=TRUE), and possibly add it to your question here
- put your file on fast disk, or a RAM disk, and read from there

If your data has a lot of distinct character variables you might not be able get great speed because of the fact that populating R's internal global character cache is single threaded, thus parsing can go fast but creating character vector(s) will be bottleneck.

Beautify answered 1/6, 2019 at 5:36 Comment(8)

Thank you, I will look into this! – Somatist 3/6, 2019 at 13:37

Quick question, what is more likely to bottleneck while reading big data, the CPU or the RAM? Also data.table was already using all the threads available. – Somatist 3/6, 2019 at 16:47

@Somatist I would say disk reading speed, try measure your drive read speed with some external tool, compare to fread's speed. CPU might be definitely an issue if there are many distinct characters, as mentioned in answer. Add add verbose output to question. – Beautify 4/6, 2019 at 6:17

I added the verbose of a 1GB data which is part of a 50GB one. Also seems like I can not use more than 2 threads, doesn't look like a data.table issue tho. – Somatist 5/6, 2019 at 20:3

@Somatist It looks like the easiest way to speed up is to use more cores. On a workstation machine there should be more than 2 threads. More details on cores can be obtained from getDTthreads(verbose=TRUE) – Beautify 6/6, 2019 at 3:20

jangorecki, here is the output : omp_get_max_threads() = 2 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true. Also by looking at the cpu on intel it should have 22 cores / 44 threads. ark.intel.com/content/www/us/en/ark/products/91317/… . Also, this cpu is on a remote pc which I access from my job's pc. I know it's suppose to have a lag tho, but it should not change fread() speed. – Somatist 6/6, 2019 at 13:45

How you access machine? ssh? just check from command line how many threads should be there. getDTthreads reports only 2. There might be a configuration on a server to allocate at most 2 threads to each user. – Beautify 7/6, 2019 at 7:2

Yes ssh, I think you are right about the allocated threads, it seems like my coworker also only have access to 2 threads. I'll try to talk to the TI about that. Thank you jangorecki, your help is really appreciate! – Somatist 7/6, 2019 at 12:59

N

4