How to select range of columns in a dataframe based on their name and not their indexes?
Asked Answered
D

5

7

In a pandas dataframe created like this:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(10, size=(6, 6)),
                  columns=['c' + str(i) for i in range(6)],
                  index=["r" + str(i) for i in range(6)])

which could look as follows:

    c0  c1  c2  c3  c4  c5
r0   2   7   3   3   2   8
r1   6   9   6   7   9   1
r2   4   0   9   8   4   2
r3   9   0   4   3   5   4
r4   7   6   8   8   0   8
r5   0   6   1   8   2   2

I can easily select certain rows and/or a range of columns using .loc:

print df.loc[['r1', 'r5'], 'c1':'c4']

That would return:

    c1  c2  c3  c4
r1   9   6   7   9
r5   6   1   8   2

So, particular rows/columns I can select in a list, a range of rows/columns using a colon.

How would one do this in R? Here and here one always has to specify the desired range of columns by their index but one cannot - or at least I did not find it - access those by name. To give an example:

df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11)
rownames(df) <- c('r1', 'r2', 'r3', 'r4', 'r5', 'r6')

The command

df[c('r1', 'r5'),'c1':'c4']

does not work and throws an error. The only thing that worked for me is

df[c('r1', 'r5'), 1:4]

which returns

   c1 c2 c3 c4
r1  1  2  3  4
r5  5  6  7  8

But how would I select the columns by their name and not by their index (which might be important when I drop certain columns throughout the analysis)? In this particular case I could of course use grep but how about columns that have arbitrary names?

So I don't want to use

df[c('r1', 'r5'),c('c1','c2', 'c3', 'c4')]

but an actual slice.

EDIT:

A follow-up question can be found here.

Doubtful answered 8/6, 2016 at 22:45 Comment(7)
@downvoter: Could you please explain why you downvoted? It is a clear question with a minimal reproducible example, so what is the problem?Doubtful
Feels a little like moving the goalposts to ask a question specifically about columns and then edit it to include rows after an answer is submitted. Might be better to roll back the edits and ask a new question. They seem very related, but column names and row names are treated quite differently in R data frames. (Though not my downvote - and not sure if this is the reason.)Drayton
@Gregor: Ok, I am not that familiar with the R specifics, in Pandas it is straight forward, I can just do: df.loc['r1':'r3', 'c1':'c4'] so rows and columns are not treated that differently. I will of course accept the provided answer (see my first comment below it) if nothing better shows up because it indeed answered the original question. And yes, you are right, I should have put the row selection in the original question, so I understand your downvote; thanks for explaining it!Doubtful
Even from the start, I think it would work better as a separate question (but still no downvotes from me). I can think of three good methods for columns: base::subset as in the answer or dplyr::select or data.table. I don't know of any methods for row names other than hacking one. But it would make a good separate question and maybe garner some specific interest/innovative solutions.Drayton
@Gregor: Ok, thanks. As mentioned, I had no idea that row and column selection are treated that differently, I edit my question again; thanks for making me aware of that.Doubtful
@Cleb, downvotes. likely because you have python and pandas tag, and the question is about "R".Supinator
@Merlin: That was to attract the pandas experts that might know how to do this in R. Wrong tags would be strange reasons for a downvote in my opinion but well,... thanks for editing.Doubtful
I
15

It looks like you can accomplish this with a subset:

> df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11)
> rownames(df) <- c('r1', 'r2', 'r3', 'r4', 'r5', 'r6')
> subset(df, select=c1:c4)
   c1 c2 c3 c4
r1  1  2  3  4
r2  2  3  4  5
r3  3  4  5  6
r4  4  5  6  7
r5  5  6  7  8
r6  6  7  8  9
> subset(df, select=c1:c2)
   c1 c2
r1  1  2
r2  2  3
r3  3  4
r4  4  5
r5  5  6
r6  6  7

If you want to subset by row name range, this hack would do:

> gRI <- function(df, rName) {which(match(rNames, rName) == 1)}
> df[gRI(df,"r2"):gRI(df,"r4"),]
   c1 c2 c3 c4 c5 c6
r2  2  3  4  5  6  7
r3  3  4  5  6  7  8
r4  4  5  6  7  8  9
Isometric answered 8/6, 2016 at 22:51 Comment(5)
That indeed works. How would you now also select rows at the same time? If you want specific rows then subset(df[c('r1', 'r3'),], select=c1:c4) would work but how about a range of rows (see my edit)? Upvote for now and might accept it later on depending on other answers' quality...Doubtful
I think standard practice would be to not name your rows and then use the standard index ranges to subset the rows. If you need row names you could always add them as an id column.Isometric
That might be a nice work-around. But still feels strange that this shouldn't be possible.Doubtful
See my recent edit for a hack to do row name range subsetting. Most R that I see does df[beginInd:endInd,] type row subsettingIsometric
Yes, subsetting by index seems to be more common but I am still surprised that there is no built-in for that.Doubtful
P
2

An alternative approach to subset if you don't mind to work with data.table would be:

data.table::setDT(df)
df[1:3, c2:c4, with=F]
   c2 c3 c4
1:  2  3  4
2:  3  4  5
3:  4  5  6

This still does not solve the problem of subsetting row range though.

Pisces answered 8/6, 2016 at 23:26 Comment(1)
Ok, thanks for the alternative (upvoted). The row selection was not part of the original question; I did not expect it to be that different from the column selection so that would rather be a bonus ;)Doubtful
B
2

A solution using dplyr package but you need to specify the row you want to select before hand

rowName2Match <- c("r1", "r5")

df1 <- df %>% 
  select(matches("2"):matches("4")) %>% 
  add_rownames() %>% 
  mutate(idRow = match(rowname, rowName2Match)) %>% 
  slice(which(!is.na(idRow))) %>% 
  select(-idRow)
df1

> df1
Source: local data frame [2 x 4]

  rowname    c2    c3    c4
   <chr> <int> <int> <int>
1      r1     2     3     4
2      r5     6     7     8
Bilbao answered 8/6, 2016 at 23:37 Comment(1)
Thanks for the alternative solution but seems far more complicated than @evan058's solution.Doubtful
P
1

Adding onto @evan058's answer:

subset(df[rownames(df) %in% c("r3", "r4", "r5"),], select=c1:c4)

c1 c2 c3 c4
r3  3  4  5  6
r4  4  5  6  7
r5  5  6  7  8

But note, the : operator will probably not work here; you will have to write out the name of each row you want to include explicitly. It might be easier to group by a particular value of one of your other columns or to create an index column as @evan058 mentioned in comments.

Penn answered 8/6, 2016 at 23:11 Comment(1)
Thanks but then subset(df[c('r1', 'r3'),], select=c1:c4) seems more convenient. But I actually want to avoid to specify the row names. Upvoted anyway ;)Doubtful
C
-1

This seems way too easy so perhaps I'm doing something wrong.

df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11,
                 row.names=c('r1', 'r2', 'r3', 'r4', 'r5', 'r6'))


df[c('r1','r2'),c('c1','c2')]

   c1 c2
r1  1  2
r2  2  3
Capello answered 8/6, 2016 at 23:23 Comment(1)
That requires specifying the rows and columns which I want to avoid. PS: I did not downvoteDoubtful

© 2022 - 2024 — McMap. All rights reserved.