How to load big csv file with mixed-type columns using the bigmemory package
Asked Answered
Q

4

6

Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?

Quiver answered 7/8, 2011 at 4:29 Comment(2)
does it have to be the bigmemory package? I find ff much more useful for this sort of stuffBlasphemous
@Blasphemous is on the right track. Does it even need to be file-backed? For 200MB, I'd just read it in, work with it, then save it as 1 or more BM files (or in ff, if you wish).Berglund
B
9

Try the ff package for this.

library(ff)
help(read.table.ffdf)

Function ‘read.table.ffdf’ reads separated flat files into ‘ffdf’ objects, very much like (and using) ‘read.table’. It can also work with any convenience wrappers like ‘read.csv’ and provides its own convenience wrapper (e.g. ‘read.csv.ffdf’) for R's usual wrappers.

For 200Mb it should be as simple a task as this.

 x <- read.csv.ffdf(file=csvfile)

(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).

Blasphemous answered 7/8, 2011 at 12:15 Comment(2)
Thank you mdsummer. I tried the ff package, was able to read in the almost 300 MB dataset which I stored into an object which I later coerced into a dataframe with as.data.frame. However, this ate up so much memory that there was little left for analysis. It was a good start though and a helpful suggestion.Quiver
The entire point is not to load it all in but use the memory-mapped features of the ff package. There are tools to extract portions from the ff data structuresBlasphemous
B
7

Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.

Since a matrix must be of one type, attempting to snooker bigmemory to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.

Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.

@mdsumner is correct in suggesting ff. Another storage option is HDF5, which you can access through ncdf4 in R. Unfortunately, these other packages are not as pleasant as bigmemory.

Berglund answered 7/8, 2011 at 12:21 Comment(1)
Thanks Iterator. You are right, the other packages are not as pleasant as bigmemory.Quiver
S
4

According to the help file, no.

Files must contain only one atomic type (all integer, for example). You, the user, should know whether your file has row and/or column names, and various combinations of options should be helpful in obtaining the desired behavior.

I'm not familiar with this package/function, but in R, matrices can have only one atomic type (unlike e.g. data.frames).

Sluice answered 7/8, 2011 at 6:37 Comment(2)
Thanks for your two cents. On this blog, joshpaulson.wordpress.com/2010/12/20/michael-kane-on-bigmemory someone suggested that a workaround to the limitation of matrices having only one atomic type (a characteristic inherited by big.matrix) is to use scan(). I was hoping someone could share their experiences with read.big.matrix from the bigmemory package, especially with regards to reading in mixed-type columns and whether they have used scan().Quiver
Maybe you can do that in the processing stage, but I would like to be proven wrong (sensu @Iterator).Melda
B
0

The best solution is to read the file line by line and parse it, in this way the reading process will occupy an amount of memory almost linear.

Buddle answered 23/3, 2013 at 14:13 Comment(1)
Welcome to StackOverflow! However, this does not answer the question, which was specifically aimed at the bigmemory packageRestore

© 2022 - 2024 — McMap. All rights reserved.