Only read selected columns

Asked 26/4, 2011 at 8:58 Answered 24/2, 2021 at 4:33

167

Can anyone please tell me how to read only the first 6 months (7 columns) for each year of the data below, for example by using read.table()?

Year   Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec   
2009   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25
2010   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25 
2011   -21  -27   -2   -6  -10  -32  -13  -12  -27  -30  -38  -29

Mcguinness answered 26/4, 2011 at 8:58 Comment(3)

It's duplicate of Ways to read only select columns from a file into R?, Dirk mention about NULL as column class in his answer. – Koine 26/4, 2011 at 10:19

stats.stackexchange.com/questions/16796/… – Anele 13/10, 2015 at 7:24

I was not implying a better / worse relationship. Furthermore there are no cross site duplicates, the inconsistent stack exchange network allows them, unless you cross post yourself :-) – Anele 15/5, 2018 at 20:13

165

Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped

> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), 
+            header = TRUE)
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.

data.txt looks like this:

$ cat data.txt 
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29

and was created by using

write.table(dat, file = "data.txt", row.names = FALSE)

where dat is

dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, 
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L, 
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L, 
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr", 
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))

If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.

## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))

Zapata answered 26/4, 2011 at 9:7 Comment(9)

@Benjamin Read the first couple of lines from the file using argument nrows. Then work out how many columns there are using ncol(), or however else you want to work out the number of columns to read/ignore. Then read the full file using this info. – Zapata 29/11, 2012 at 0:42

argh...thats ugly but I suppose it will have to do – Dextrosinistral 29/11, 2012 at 0:44

?? If you don't know the number of columns how else are you going to determine it without reading a bit of it to deduce how many there are? – Zapata 29/11, 2012 at 0:45

@BlueMagister Thanks for the edit and the mentioning of the count.fields() which automates the process I suggested in the comments. – Zapata 20/9, 2013 at 19:3

Can this principle also applied on rows like some rowClasses? Etc to set NULL on blank row. – Ackerman 13/5, 2017 at 17:3

@LéoLéopoldHertz준영 No, and I'm not sure how such a thing would work for row classes as in a data frame, whilst each column may be of a different type, each row is, by definition and as a result, unconstrained. You will need to filter out blank rows etc upon import. – Zapata 15/5, 2017 at 18:53

Is it not possible to limit count.fields() to scan n rows?. Looks like it reads through the whole table, every row. – Ineludible 22/1, 2018 at 21:42

@rmf you can pass count.fields() a text connection, so, read some subset of the rows using txt <- readLines(....), then create a connection to the read in lines con <- textConnection(txt), then do count.fields(txt). Be sure to use skip in count.fields() to skip the header row if there is one; you can't skip rows in the file using readLines(). – Zapata 22/1, 2018 at 21:57

I am trying to use read.table to only read columns whose name is also included in a vector labels <- c("a", "b","c"...). My issue is that I am reading more than one .txt file, and each file has some of the labels in the vector. would there be a way to use read.table and %in% to only read the labels that match with the ones in each .txt file?. – Ontogeny 29/4, 2020 at 18:55

119

To read a specific set of columns from a dataset you, there are several other options:

1) With freadfrom the data.table-package:

You can specify the desired columns with the select parameter from fread from the data.table package. You can specify the columns with a vector of column names or column numbers.

For the example dataset:

library(data.table)
dat <- fread("data.txt", select = c("Year","Jan","Feb","Mar","Apr","May","Jun"))
dat <- fread("data.txt", select = c(1:7))

Alternatively, you can use the drop parameter to indicate which columns should not be read:

dat <- fread("data.txt", drop = c("Jul","Aug","Sep","Oct","Nov","Dec"))
dat <- fread("data.txt", drop = c(8:13))

All result in:

> data
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

UPDATE: When you don't want fread to return a data.table, use the data.table = FALSE-parameter, e.g.: fread("data.txt", select = c(1:7), data.table = FALSE)

2) With read.csv.sql from the sqldf-package:

Another alternative is the read.csv.sql function from the sqldf package:

library(sqldf)
dat <- read.csv.sql("data.txt",
                    sql = "select Year,Jan,Feb,Mar,Apr,May,Jun from file",
                    sep = "\t")

3) With the read_*-functions from the readr-package:

library(readr)
dat <- read_table("data.txt",
                  col_types = cols_only(Year = 'i', Jan = 'i', Feb = 'i', Mar = 'i',
                                        Apr = 'i', May = 'i', Jun = 'i'))
dat <- read_table("data.txt",
                  col_types = list(Jul = col_skip(), Aug = col_skip(), Sep = col_skip(),
                                   Oct = col_skip(), Nov = col_skip(), Dec = col_skip()))
dat <- read_table("data.txt", col_types = 'iiiiiii______')

From the documentation an explanation for the used characters with col_types:

each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column

Lyly answered 18/10, 2015 at 18:0 Comment(16)

fread does not support compressed files, however. Large files are usually compressed. – Arianna 28/2, 2016 at 9:3

There is a feature request for enabling this in fread. Worth noticing is that fread will highly probably read the uncompressed file considerably faster than read.table will read the compressed file. See here for an example. – Lyly 28/2, 2016 at 10:44

Some uncompressed files are too large. E.g. I'm working with 1000 Genomes files. They can be 60 GB uncompressed. – Arianna 28/2, 2016 at 10:45

As you probably know, R reads the data in memory. Whether you read the zipped file or the unzipped file doesn't make a difference on the size of the resulting data in memory. If you have 60GB in files, read.table won't save you. In that case, you might want to look at the ff-package. – Lyly 20/4, 2016 at 8:42

I am trying only to load a few lines a time. The problem is that read.table() spends all its time searching for the line to start reading, making it useless for reading large files in increments. They need to fix that function. I had to resort to a non-R solution Plink for this problem. Pity. – Arianna 21/4, 2016 at 5:12

FYI: the question is about reading a limited number of columns, not about reading large files. If you have a problem with reading large files, you can post a new question. – Lyly 21/4, 2016 at 6:26

@Deleet You could use fread to read large compressed files like this: fread("gunzip -c data.txt.gz", drop = c(8:13)). – Davina 23/6, 2016 at 9:41

@Procrastinatus Maximus I would like to echo Deleet's comments w.r.t. to genomics data files. These are often stored compressed to save disk space, but also tend to compress well so that reading it off networked storage then decrypting is faster than reading uncompressed off disk. – Akkadian 28/10, 2016 at 17:40

Did readr ever add a feature to say (a) guess everything, but (b) only read in a list of column names? I don't want to say "a" = col_guess(), "b" = col_guess(), ... I just want to define vars <- c("a", "b", ...) and then call it as col_types = cols_only(vars) – Vanden 30/9, 2017 at 19:8

@MarkWhite see the last possibility of readr; based on that you could also use col_types = '???????______' which guess the first 7 columns and exclude the others; if you don't specify anything, readr will guess the column types for you (so yes, readr will guess everything when you don't specify something) – Lyly 30/9, 2017 at 21:32

@Lyly but that requires me to know the positions of the variables in the data, which I do not know until after I read the data in. I found this solution, which works well for me: #43901643 – Vanden 30/9, 2017 at 21:35

@MarkWhite I think for that situation fread works much better; see the first part of my answer. – Lyly 30/9, 2017 at 21:39

@Lyly agreed. Yet that function parses NAs strangely when there is conflicts in guessing, so I like that solution I linked to most, imo. It should be a part of the function itself, probably – Vanden 30/9, 2017 at 21:45

@MarkWhite If you think fread parses NAs strangely when there are conflicts in guessing (which I've never encountered), please report it on the Github-page (i.e. post a 'new issue'). – Lyly 30/9, 2017 at 21:49

@Lyly I am trying to use read.table to only read columns whose name is also included in a vector labels <- c("a", "b","c"...). My issue is that I am reading more than one .txt file, and each file has some of the labels in the vector. would there be a way to use read.table or fread and %in% to only read the labels that match with the ones in each .txt file?. – Ontogeny 29/4, 2020 at 19:0

@Lyly edit - I just tried it and it seems like there is no need to specify much - the function read all the columns matching the vector, and returned no error for those inexistent. Wonderful! – Ontogeny 29/4, 2020 at 19:8

You could also use JDBC to achieve this. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd()))

> head(dbGetQuery(conn, "select * from mtcars"), 3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   21   6  160 110  3.9  2.62 16.46  0  1    4    4
2   21   6  160 110  3.9 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1

> head(dbGetQuery(conn, "select mpg, gear from mtcars"), 3)
   MPG GEAR
1   21    4
2   21    4
3 22.8    4

Disease answered 15/3, 2014 at 13:31 Comment(0)

The vroom package provides a 'tidy' method of selecting / dropping columns by name during import. Docs: https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#column-selection

Column selection (col_select)

The vroom argument 'col_select' makes selecting columns to keep (or omit) more straightforward. The interface for col_select is the same as dplyr::select().

Select columns by name

data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

Drop columns by name

data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Use the selection helpers
data <- vroom("flights.tsv", col_select = ends_with("time"))
#> Observations: 336,776
#> Variables: 5
#> dbl [5]: dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

Or rename columns by name

data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#>    plane  year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr> <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1 N142…  2013     1     1      517            515         2      830
#>  2 N242…  2013     1     1      533            529         4      850
#>  3 N619…  2013     1     1      542            540         2      923
#>  4 N804…  2013     1     1      544            545        -1     1004
#>  5 N668…  2013     1     1      554            600        -6      812
#>  6 N394…  2013     1     1      554            558        -4      740
#>  7 N516…  2013     1     1      555            600        -5      913
#>  8 N829…  2013     1     1      557            600        -3      709
#>  9 N593…  2013     1     1      557            600        -3      838
#> 10 N3AL…  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

Fungicide answered 24/2, 2021 at 4:33 Comment(0)

You do it like this:

df = read.table("file.txt", nrows=1, header=TRUE, sep="\t", stringsAsFactors=FALSE)
colClasses = as.list(apply(df, 2, class))
needCols = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun")
colClasses[!names(colClasses) %in% needCols] = list(NULL)
df = read.table("file.txt", header=TRUE, colClasses=colClasses, sep="\t", stringsAsFactors=FALSE)

Gilburt answered 11/4, 2020 at 0:28 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Column selection (col_select)

Recommended topics

Hot tags