Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?
Try the ff package for this.
library(ff)
help(read.table.ffdf)
Function ‘read.table.ffdf’ reads separated flat files into ‘ffdf’ objects, very much like (and using) ‘read.table’. It can also work with any convenience wrappers like ‘read.csv’ and provides its own convenience wrapper (e.g. ‘read.csv.ffdf’) for R's usual wrappers.
For 200Mb it should be as simple a task as this.
x <- read.csv.ffdf(file=csvfile)
(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).
Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.
Since a matrix must be of one type, attempting to snooker bigmemory
to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.
Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.
@mdsumner is correct in suggesting ff
. Another storage option is HDF5, which you can access through ncdf4
in R. Unfortunately, these other packages are not as pleasant as bigmemory
.
According to the help file, no.
Files must contain only one atomic type (all integer, for example). You, the user, should know whether your file has row and/or column names, and various combinations of options should be helpful in obtaining the desired behavior.
I'm not familiar with this package/function, but in R, matrices can have only one atomic type (unlike e.g. data.frames).
The best solution is to read the file line by line and parse it, in this way the reading process will occupy an amount of memory almost linear.
© 2022 - 2024 — McMap. All rights reserved.
ff
, if you wish). – Berglund