Import text file using ff package

Asked 31/1, 2014 at 20:27 Answered 11/2, 2015 at 14:45

I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).

So, I use the following code to import:

test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)

but this returns the following error message :

Error in read.table.ffdf(FUN = "read.csv2", ...) : 
only ffdf objects can be used for appending (and skipping the first.row chunk)

What am I doing wrong?

Here are the first 5 rows of the text file:

    CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1             1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2             1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3             1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4            1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5             1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z

Nunn answered 31/1, 2014 at 20:27 Comment(3)

The error shows that your csv file is not well formatted. Maybe you should skip some rows. If you under linux use head or awk to show first lines of your file and add them to the question. – Nee 31/1, 2014 at 20:49

Try read.table.ffdf(yourfile,sep=';',skip=1) – Nee 31/1, 2014 at 21:2

Tried this and get the same error message. – Nunn 31/1, 2014 at 21:8

I encountered a similar problem related to reading csv into ff objects. On using

read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")

instead of implicit call

read.csv2.ffdf("FD_INDCVIZC_2010.txt")

I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.

Lolland answered 5/5, 2014 at 7:55 Comment(0)

You could try the following code:

read.csv2.ffdf("FD_INDCVIZC_2010.txt", 
           sep = "\t",
           VERBOSE = TRUE,
           first.rows = 100000,
           next.rows = 200000,
           header=T)

I am assuming that since its a txt file, its a tab-delimited file.

Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.

Ingeringersoll answered 18/4, 2014 at 9:1 Comment(0)

If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:

textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)

You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.

This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.

Ambivert answered 31/1, 2014 at 21:7 Comment(0)

perhaps you could try the following code:

read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )

Blood answered 11/2, 2015 at 14:45 Comment(0)

Recommended topics

Hot tags