Most efficient way of exporting large (3.9 mill obs) data.frames to text file? [duplicate]
Asked Answered
T

5

24

I have a fairly large dataframe in R that I would like to export to SPSS. This file has caused me hours of headaches trying to import it to R in the first place, however I got successful using read.fwf() using the options comment.char="%" (a character not appearing in the file) and fill= TRUE(it was a fixed-width ASCII file with some rows lacking all variables, causing error messages).

Anyway, my data frame currently consists of 3,9 mill observations and 48 variables (all character). I can write it to file fairly quickly by splitting it into 4 x 1 mill obs sets with df2 <- df[1:1000000,] followed by write.table(df2) etc., but can't write the entire file in one sweep without the computer locking up and needing a hard reset to come back up.

After hearing anecdotal stories about how R is unsuited for large datasets for years, this is the first time I have actually encountered a problem of this kind. I wonder whether there are other approaches(low-level "dumping" the file directly to disk?) or whether there is some package unknown to me that can handle export of large files of this type efficiently?

Talos answered 14/3, 2012 at 13:36 Comment(0)
E
8

At a guess, your machine is short on RAM, and so R is having to use the swap file, which slows things down. If you are being paid to code, then buying more RAM will probably be cheaper than you writing new code.

That said, there are some possibilities. You can export the file to a database and then use that database's facility for writing to a text file. JD Long's answer to this question tells you how to read in files in this way; it shouldn't be too difficult to reverse the process. Alternatively the bigmemory and ff packages (as mentioned by Davy) could be used for writing such files.

Electromagnet answered 14/3, 2012 at 14:57 Comment(1)
Hi Richie,I'm not sure whether 8 Gb RAM qualifies as "short on RAM", even with this dataset. However, I'll look into using sqldf() as suggested by JD Long since I'm using it a lot in my analyses. Thanks for the pointer!Talos
G
25

1) If your file is all character strings, then it saves using write.table() much faster if you first change it to a matrix.

2) also write it out in chunks of, say 1000000 rows, but always to the same file, and using the argument append = TRUE.

Gaunt answered 15/3, 2012 at 6:58 Comment(4)
REally clever solution. Won't work for data.frames where the variables are different types, but definitely a good fix here!Vadavaden
hehe, I had to do the same thing with data of nearly the same dimensions: you wouldn't happen to be working with US birth or death microdata, would you?Gaunt
@tim riffe: No, but sort of, these are cow birth and calving data :)Talos
write.csv ran over night for me and still didnt finish, converting to a matrix and using write.table took secondsYulandayule
C
18

Update

After extensive work by Matt Dowle parallelizing and adding other efficiency improvements, fread is now as much as 15x faster than write.csv. See linked answer for more.


Now data.table has an fwrite function contributed by Otto Seiskari which seems to be about twice as fast as write.csv in general. See here for some benchmarks.

library(data.table) 
fwrite(DF, "output.csv")

Note that row names are excluded, since the data.table type makes no use of them.

Chinookan answered 8/4, 2016 at 4:42 Comment(0)
E
8

At a guess, your machine is short on RAM, and so R is having to use the swap file, which slows things down. If you are being paid to code, then buying more RAM will probably be cheaper than you writing new code.

That said, there are some possibilities. You can export the file to a database and then use that database's facility for writing to a text file. JD Long's answer to this question tells you how to read in files in this way; it shouldn't be too difficult to reverse the process. Alternatively the bigmemory and ff packages (as mentioned by Davy) could be used for writing such files.

Electromagnet answered 14/3, 2012 at 14:57 Comment(1)
Hi Richie,I'm not sure whether 8 Gb RAM qualifies as "short on RAM", even with this dataset. However, I'll look into using sqldf() as suggested by JD Long since I'm using it a lot in my analyses. Thanks for the pointer!Talos
V
7

Though I only use it to read very large files (10+ Gb) I believe the ff package has functions for writing extremely large dfs.

Vadavaden answered 14/3, 2012 at 14:40 Comment(1)
I tried my luck with ff() but was perplexed by the syntax used. Couldn't quite wrap my head around it and trying it on subsets of the original data set didn't give me much gains time wise. Thanks anyway.Talos
J
7

Well, as the answer with really large files and R often is, its best to offload this kind of work to a database. SPSS has ODBC connectivity, and the RODBC provides an interface from R to SQL.

I note, that in the process of checking out my information, I have been scooped.

Julietjulieta answered 14/3, 2012 at 15:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.