Only read selected columns
Asked Answered
M

5

167

Can anyone please tell me how to read only the first 6 months (7 columns) for each year of the data below, for example by using read.table()?

Year   Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec   
2009   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25
2010   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25 
2011   -21  -27   -2   -6  -10  -32  -13  -12  -27  -30  -38  -29
Mcguinness answered 26/4, 2011 at 8:58 Comment(3)
It's duplicate of Ways to read only select columns from a file into R?, Dirk mention about NULL as column class in his answer.Koine
stats.stackexchange.com/questions/16796/…Anele
I was not implying a better / worse relationship. Furthermore there are no cross site duplicates, the inconsistent stack exchange network allows them, unless you cross post yourself :-)Anele
Z
165

Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped

> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), 
+            header = TRUE)
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.

data.txt looks like this:

$ cat data.txt 
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29

and was created by using

write.table(dat, file = "data.txt", row.names = FALSE)

where dat is

dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, 
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L, 
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L, 
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr", 
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))

If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.

## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))
Zapata answered 26/4, 2011 at 9:7 Comment(9)
@Benjamin Read the first couple of lines from the file using argument nrows. Then work out how many columns there are using ncol(), or however else you want to work out the number of columns to read/ignore. Then read the full file using this info.Zapata
argh...thats ugly but I suppose it will have to doDextrosinistral
?? If you don't know the number of columns how else are you going to determine it without reading a bit of it to deduce how many there are?Zapata
@BlueMagister Thanks for the edit and the mentioning of the count.fields() which automates the process I suggested in the comments.Zapata
Can this principle also applied on rows like some rowClasses? Etc to set NULL on blank row.Ackerman
@LéoLéopoldHertz준영 No, and I'm not sure how such a thing would work for row classes as in a data frame, whilst each column may be of a different type, each row is, by definition and as a result, unconstrained. You will need to filter out blank rows etc upon import.Zapata
Is it not possible to limit count.fields() to scan n rows?. Looks like it reads through the whole table, every row.Ineludible
@rmf you can pass count.fields() a text connection, so, read some subset of the rows using txt <- readLines(....), then create a connection to the read in lines con <- textConnection(txt), then do count.fields(txt). Be sure to use skip in count.fields() to skip the header row if there is one; you can't skip rows in the file using readLines().Zapata
I am trying to use read.table to only read columns whose name is also included in a vector labels <- c("a", "b","c"...). My issue is that I am reading more than one .txt file, and each file has some of the labels in the vector. would there be a way to use read.table and %in% to only read the labels that match with the ones in each .txt file?.Ontogeny
L
119

To read a specific set of columns from a dataset you, there are several other options:

1) With freadfrom the data.table-package:

You can specify the desired columns with the select parameter from fread from the data.table package. You can specify the columns with a vector of column names or column numbers.

For the example dataset:

library(data.table)
dat <- fread("data.txt", select = c("Year","Jan","Feb","Mar","Apr","May","Jun"))
dat <- fread("data.txt", select = c(1:7))

Alternatively, you can use the drop parameter to indicate which columns should not be read:

dat <- fread("data.txt", drop = c("Jul","Aug","Sep","Oct","Nov","Dec"))
dat <- fread("data.txt", drop = c(8:13))

All result in:

> data
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

UPDATE: When you don't want fread to return a data.table, use the data.table = FALSE-parameter, e.g.: fread("data.txt", select = c(1:7), data.table = FALSE)

2) With read.csv.sql from the sqldf-package:

Another alternative is the read.csv.sql function from the sqldf package:

library(sqldf)
dat <- read.csv.sql("data.txt",
                    sql = "select Year,Jan,Feb,Mar,Apr,May,Jun from file",
                    sep = "\t")

3) With the read_*-functions from the readr-package:

library(readr)
dat <- read_table("data.txt",
                  col_types = cols_only(Year = 'i', Jan = 'i', Feb = 'i', Mar = 'i',
                                        Apr = 'i', May = 'i', Jun = 'i'))
dat <- read_table("data.txt",
                  col_types = list(Jul = col_skip(), Aug = col_skip(), Sep = col_skip(),
                                   Oct = col_skip(), Nov = col_skip(), Dec = col_skip()))
dat <- read_table("data.txt", col_types = 'iiiiiii______')

From the documentation an explanation for the used characters with col_types:

each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column

Lyly answered 18/10, 2015 at 18:0 Comment(16)
fread does not support compressed files, however. Large files are usually compressed.Arianna
There is a feature request for enabling this in fread. Worth noticing is that fread will highly probably read the uncompressed file considerably faster than read.table will read the compressed file. See here for an example.Lyly
Some uncompressed files are too large. E.g. I'm working with 1000 Genomes files. They can be 60 GB uncompressed.Arianna
As you probably know, R reads the data in memory. Whether you read the zipped file or the unzipped file doesn't make a difference on the size of the resulting data in memory. If you have 60GB in files, read.table won't save you. In that case, you might want to look at the ff-package.Lyly
I am trying only to load a few lines a time. The problem is that read.table() spends all its time searching for the line to start reading, making it useless for reading large files in increments. They need to fix that function. I had to resort to a non-R solution Plink for this problem. Pity.Arianna
FYI: the question is about reading a limited number of columns, not about reading large files. If you have a problem with reading large files, you can post a new question.Lyly
@Deleet You could use fread to read large compressed files like this: fread("gunzip -c data.txt.gz", drop = c(8:13)).Davina
@Procrastinatus Maximus I would like to echo Deleet's comments w.r.t. to genomics data files. These are often stored compressed to save disk space, but also tend to compress well so that reading it off networked storage then decrypting is faster than reading uncompressed off disk.Akkadian
Did readr ever add a feature to say (a) guess everything, but (b) only read in a list of column names? I don't want to say "a" = col_guess(), "b" = col_guess(), ... I just want to define vars <- c("a", "b", ...) and then call it as col_types = cols_only(vars)Vanden
@MarkWhite see the last possibility of readr; based on that you could also use col_types = '???????______' which guess the first 7 columns and exclude the others; if you don't specify anything, readr will guess the column types for you (so yes, readr will guess everything when you don't specify something)Lyly
@Lyly but that requires me to know the positions of the variables in the data, which I do not know until after I read the data in. I found this solution, which works well for me: #43901643Vanden
@MarkWhite I think for that situation fread works much better; see the first part of my answer.Lyly
@Lyly agreed. Yet that function parses NAs strangely when there is conflicts in guessing, so I like that solution I linked to most, imo. It should be a part of the function itself, probablyVanden
@MarkWhite If you think fread parses NAs strangely when there are conflicts in guessing (which I've never encountered), please report it on the Github-page (i.e. post a 'new issue').Lyly
@Lyly I am trying to use read.table to only read columns whose name is also included in a vector labels <- c("a", "b","c"...). My issue is that I am reading more than one .txt file, and each file has some of the labels in the vector. would there be a way to use read.table or fread and %in% to only read the labels that match with the ones in each .txt file?.Ontogeny
@Lyly edit - I just tried it and it seems like there is no need to specify much - the function read all the columns matching the vector, and returned no error for those inexistent. Wonderful!Ontogeny
D
9

You could also use JDBC to achieve this. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd()))

> head(dbGetQuery(conn, "select * from mtcars"), 3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   21   6  160 110  3.9  2.62 16.46  0  1    4    4
2   21   6  160 110  3.9 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1

> head(dbGetQuery(conn, "select mpg, gear from mtcars"), 3)
   MPG GEAR
1   21    4
2   21    4
3 22.8    4
Disease answered 15/3, 2014 at 13:31 Comment(0)
F
5

The vroom package provides a 'tidy' method of selecting / dropping columns by name during import. Docs: https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#column-selection

Column selection (col_select)

The vroom argument 'col_select' makes selecting columns to keep (or omit) more straightforward. The interface for col_select is the same as dplyr::select().

Select columns by name
data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Drop columns by name
data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Use the selection helpers
data <- vroom("flights.tsv", col_select = ends_with("time"))
#> Observations: 336,776
#> Variables: 5
#> dbl [5]: dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Or rename columns by name
data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#>    plane  year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr> <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1 N142…  2013     1     1      517            515         2      830
#>  2 N242…  2013     1     1      533            529         4      850
#>  3 N619…  2013     1     1      542            540         2      923
#>  4 N804…  2013     1     1      544            545        -1     1004
#>  5 N668…  2013     1     1      554            600        -6      812
#>  6 N394…  2013     1     1      554            558        -4      740
#>  7 N516…  2013     1     1      555            600        -5      913
#>  8 N829…  2013     1     1      557            600        -3      709
#>  9 N593…  2013     1     1      557            600        -3      838
#> 10 N3AL…  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>
Fungicide answered 24/2, 2021 at 4:33 Comment(0)
G
0

You do it like this:

df = read.table("file.txt", nrows=1, header=TRUE, sep="\t", stringsAsFactors=FALSE)
colClasses = as.list(apply(df, 2, class))
needCols = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun")
colClasses[!names(colClasses) %in% needCols] = list(NULL)
df = read.table("file.txt", header=TRUE, colClasses=colClasses, sep="\t", stringsAsFactors=FALSE)
Gilburt answered 11/4, 2020 at 0:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.