Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]
Asked Answered
P

4

44

I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file.

The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want.

Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the question how to call a shell script and capture its output in R?).

Pegeen answered 3/2, 2010 at 17:2 Comment(2)
A whole set of useful answers here. Any one of them would be helpful for a given context. The accepted one was simply the closest to my actual case and included a code snippet. (I could just have well have picked Dirk but it looks like he has plenty of reputation already ;-) )Pegeen
Best answer is in new question https://mcmap.net/q/143826/-only-read-selected-columns/168747Tantamount
F
36

Sometimes I do something like this when I have the data in a tab-delimited file:

df <- read.table(pipe("cut -f1,5,28 myFile.txt"))

That lets cut do the data selection, which it can do without using much memory at all.

See Only read limited number of columns for pure R version, using "NULL" in the colClasses argument to read.table.

Faulty answered 3/2, 2010 at 17:57 Comment(7)
Your first example is pretty much exactly what I have ended up using. (awk instead of cut in my case because of an irregular delimited file format). Your second example isn't truly equivalent as I understand it. Isn't it going to create the whole data.frame only to throw it away again? When I want 2 of 10 columns from a million row file that is a big different in performance.Pegeen
No, the pure R equivalant would be something like (assuming 28 colums) mycols <- rep(NULL, 28); mycols[c(1,5,28)] <- NA; df <- read.table(file, colClasses=mycols)Sparrow
@DirkEddelbuettel I just chanced upon this. It does appear that NULL needs to be in quotes.Downwind
I like the solution provided by @DirkEddelbuettel, however as for .@RJ I had to put NULL in quotes. 'mycols <- rep("NULL", 28)'Otiliaotina
What if the name of file is in my R variable (so that I cannot use fixed "myFile.txt")?Syman
In the context set by @DirkEddelbuettel's comment you can do the same with read.csv as well. df <- read.csv(file, colClasses=mycols)Striated
I am trying to use read.table to only read columns whose name is also included in a vector labels <- c("a", "b","c"...). My issue is that I am reading more than one .txt file, and each file has some of the labels in the vector. would there be a way to use read.table and %in% to only read the labels that match with the ones in each .txt file?.Pyrrhuloxia
S
20

One possibility is to use pipe() in lieu of the filename and have awk or similar filters extract only the columns you want.

See help(connection) for more on pipe and friends.

Edit: read.table() can also do this for you if you are very explicit about colClasses -- a value of NULL for a given column skips the column alltogether. See help(read.table). So there we have a solution in base R without additional packages or tools.

Sparrow answered 3/2, 2010 at 17:9 Comment(0)
A
8

I think Dirk's approach is straight forward as well as fast. An alternative that I've used is to load the data into sqlite which loads MUCH faster than read.table() and then pull out only what you want. the package sqldf() makes this all quite easy. Here's a link to a previous stack overflow answer that gives code examples for sqldf().

Attending answered 3/2, 2010 at 17:16 Comment(0)
T
3

This is probably more than you need, but if you're operating on very large data sets then you might also have a look at the HadoopStreaming package which provides a map-reduce routine using Hadoop.

Tishatishri answered 3/2, 2010 at 17:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.