write a gzip file from data frame
Asked Answered
H

6

16

I'm trying to write a data frame to a gzip file but having problems.

Here's my code example:

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))

gz1 <- gzfile("df1.gz","w" )
writeLines(df1)

Error in writeLines(df1) : invalid 'text' argument

Any suggestions?

EDIT: an example line of the character vector I'm trying to write is:

0 | var1:1.5 var2:.55 var7:1250

The class label / y-variable is separated from the x-vars by a " | " and variable names are separated from values by " : " and spaces between variables.

EDIT2: I apologize for the wording / format of the question but here are the results: Old method:

system.time(write(out1, file="out1.txt"))
#    user  system elapsed 
#   9.772  17.205  86.860 

New Method:

writeGzFile <- function(){
  gz1 = gzfile("df1.gz","w");
  write(out1, gz1);
  close(gz1) 
}

system.time( writeGzFile())
#    user  system elapsed 
#   2.312   0.000   2.478 

Thank you all very much for helping me figure this out.

Hypersonic answered 8/1, 2013 at 23:2 Comment(7)
As is often asked on Rhelp: "What problem are you trying to solve".Catiline
Hint: the answer @DWin comment is not "How do I write a data frame to a gzip file?"Drake
The longer question would be "Is it faster to write a .txt file or a .gz file from R?"Hypersonic
That depends on how long your piece of string is. In computer terms, whether your CPU or I/O is the bottleneck. Writing a big file to a fast disk is quicker than computing a compressed form on a slow CPU.Drake
I was hoping to get an answer to the question "what purpose might there be in processing the R data object in a manner other than achieved by save"? Do you need it to be read by a program other than R?Catiline
Yes. Please see comment stream in Spacedman's answer.Hypersonic
The examples in ?readRDS helped me understand the compression and serialization that R does in readRDS and saveRDS.Balky
A
27

writeLines expects a list of strings. The simplest way to write this to a gzip file would be

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))
gz1 <- gzfile("df1.gz", "w")
write.csv(df1, gz1)
close(gz1)

This will write it as a gzipped csv. Also see write.table and write.csv2 for alternate ways of writing the file out.

EDIT:Based on the updates to the post about desired format, I made the following helper (quickly thrown together, probably admits tons of simplification):

function(df) {
    rowCount <- nrow(df)
    dfNames <- names(df)
    dfNamesIndex <- length(dfNames)
    sapply(1:rowCount, function(rowIndex) {
        paste(rowIndex, '|', 
            paste(sapply(1:dfNamesIndex, function(element) {
                c(dfNames[element], ':', df[rowIndex, element])
            }), collapse=' ')
        )
    })
}

So the output looks like

a <- data.frame(x=1:10,y=rnorm(10))
writeLines(myser(a))
# 1 | x : 1 y : -0.231340933021948
# 2 | x : 2 y : 0.896777389870928
# 3 | x : 3 y : -0.434875004781075
# 4 | x : 4 y : -0.0269824962632977
# 5 | x : 5 y : 0.67654540494899
# 6 | x : 6 y : -1.96965253674725
# 7 | x : 7 y : 0.0863177759402661
# 8 | x : 8 y : -0.130116466571162
# 9 | x : 9 y : 0.418337557610229
# 10 | x : 10 y : -1.22890714891874

And all that is necessary is to pass the gzfile in to writeLines to get the desired output.

Anthology answered 8/1, 2013 at 23:9 Comment(2)
For people using VW, see also this answer for faster options than writeLines: https://mcmap.net/q/692145/-read-write-data-in-libsvm-formatWilds
for people wanting to write large data to files, fwrite (answer below) is much faster than write.csv.Rittenhouse
D
5

To write something to a gzip file you need to "serialize" it to text. For R objects you can have a stab at that by using dput:

gz1 = gzfile("df1.gz","w")
dput(df1, gz1)
close(gz1)

However you've just written a text representation of the data frame to the file. This will quite probably be less efficient than using save(df1,file="df1.RData") to save it to a native R data file. Ask yourself: why am I saving it as a .gz file?

In a quick test with some random numbers, the gz file was 54k, the .RData file was 34k

Drake answered 8/1, 2013 at 23:10 Comment(5)
Thank you. The reason I'm writing to .gz is that the output is an input file for another program that reads .gz files. In other words it's leaving the R ecosystem. Otherwise I'd use .RData.Hypersonic
So just gzip the .RData file? No, that won't work, because gzip is a compression that tells you nothing about the format of the data in the file when uncompressed. Is it a gzipped CSV file, a gzipped NetCDF file, a gzipped RData file? You haven't told us.Drake
Sorry, I'm using it as an input file for a program called vowpal wabbit. It has some weird delimiting using '|', ':' and ' '.Hypersonic
We're getting closer to the real question. Want to edit yours to say more of what it is you are wanting to do? It seems the other answer (write.csv) could be better. But that's guesswork.Drake
I'm current using 'write(df1, file = "df1.txt")'. But it's taking a long time to run (It's ~200k rows). I was curious if using .gz would be faster, but couldn't get R to write a .gz file, which is the reason for the question.Hypersonic
K
5

Another very simple way to do it is:

# We create the .csv file
write.csv(df1, "df1.csv")

# We compress it deleting the .csv
system("gzip df1.csv")

Got the idea from: http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

Kacykaczer answered 3/1, 2017 at 16:59 Comment(0)
V
2

For tidyverse methods adding the compression extension to the file name will perform the compression. From https://readr.tidyverse.org/reference/write_delim.html

The write_*() functions will automatically compress outputs if an appropriate extension is given. At present, three extensions are supported, .gz for gzip compression, .bz2 for bzip2 compression and .xz for lzma compression.

library(tidyverse)
df <- data.table(var1='Compress me',var2=', please!')
write_csv(df, "filename.csv.gz")
Valerianaceous answered 22/9, 2020 at 14:8 Comment(0)
D
2

It's working out of the box with data.tables fwrite function:

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))
data.table::fwrite(df1, file = "df1.csv.gz")
Dumuzi answered 25/7, 2022 at 13:18 Comment(1)
fwrite is much faster than write.csv for large data/files.Rittenhouse
R
1

You can use the gzip function in R.utils:

library(R.utils)
library(data.table)

#Write gzip file
df <- data.table(var1='Compress me',var2=', please!')
fwrite(df,'filename.csv',sep=',')
gzip('filename.csv',destname='filename.csv.gz')`

#Read gzip file
fread('gzip -dc filename.csv.gz')
          var1      var2
1: Compress me , please!
Roaster answered 23/5, 2018 at 2:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.