How to select all columns whose names start with X in a pandas DataFrame
Asked Answered
P

12

179

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]
Philippi answered 3/12, 2014 at 15:15 Comment(0)
F
257

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
Feverfew answered 3/12, 2014 at 15:21 Comment(2)
why does a list comprehension over a dataframe (the col in df part) loop over the names of the columns in the dataframe? rather than looping over each column (so that col would be a series)? (i ask because in R the equivalent for loop syntax would loop over the vectors that are the columns). (note that [col for col in df.columns if col.startswith('foo')] gives the right output too but makes more sense to me)Lisle
@RichardDiSalvo the list comprehension iterates through the df.columns and not the dataframe as a whole. Think of a dataframe as a dictionary, when iterating through a dictionary you go through the keys; if you wish to get the values, then you call .items. so your example that uses df.columns is the same as the solution provided hereTobiastobie
M
109

Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0
Millionaire answered 3/12, 2014 at 15:28 Comment(1)
This is the answer I came here for, which matches the question title. What the OP actually wanted was more like "Best way to select rows with a filter based only on columns starting with x".Aggrandize
I
18

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]


Incontrovertible answered 22/10, 2019 at 16:7 Comment(0)
P
9

In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]
Pascal answered 18/8, 2020 at 23:42 Comment(0)
P
9

You can use the method filter with the parameter like:

df.filter(like='foo')
Playbook answered 17/9, 2021 at 6:27 Comment(0)
W
6

You can try the regex here to filter out the columns starting with "foo"

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of 'foo*' column is 1.

Wirework answered 16/6, 2020 at 3:46 Comment(1)
* at the end of regexes does not make sense — we are not looking for foooooooo. It seems you wanted ^foo.* instead. In fact, one can simply remove it, as df.filter does not require full match (from beginning to end), so ^foo will work as well.Emilie
B
3

Based on @EdChum's answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column's name.

In essence, I replaced .startswith() with .contains().

Bestride answered 24/9, 2019 at 15:50 Comment(0)
V
2

Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr's answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won't be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 - as your desired output suggests - you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
Veach answered 4/2, 2016 at 18:11 Comment(0)
J
2

Even you can try this for multiple prefix:

temp = df.loc[:, df.columns.str.startswith(('prefix1','prefix2','prefix3'))]
Judicative answered 21/7, 2022 at 6:51 Comment(0)
E
1

I do not like that other solutions require us to refer to the DataFrame twice; it might be fine if you have only one frame named df, but this is often not the case (and your actual name might be much longer). Let's abuse pandas indexing capabilities to type less, and make the code more readable. There is nothing stopping us from using something like this:

df.loc[:, columns.startswith('foo')]

Because the indexer can be any Callable. We can then even assign this pseudo-indexer to a variable and use it for multiple frames:

foo_columns = columns.startswith('foo')
df_1.loc[:, foo_columns]
df_2.loc[:, foo_columns]

We can even make it pretty-print:

> foo_columns
<function __main__.PandasIndexer:columns.str.startswith(pat='foo')()>

And we can use any other method of the str accessor, e.g. columns.contains(r'bar\d', regex=True), all while getting useful signatures:

> columns.contains
<function __main__.PandasIndexer:columns.str.contains(pat, case=True, flags=0, na=None, regex=True)>

All with this short magic code:

from pandas import Series
from inspect import signature, Signature


class PandasIndexer:
    def __init__(self, axis_name, accessor='str'):
        """
        Args:
            - axis_name: `columns` or `index`
            - accessor: e.g. `str`, or `dt`
        """
        self._axis_name = axis_name
        self._accessor = accessor
        self._dummy_series = Series(dtype=object)

    def _create_indexer(self, attribute):
        dummy_accessor = getattr(self._dummy_series, self._accessor)
        dummy_attr = getattr(dummy_accessor, attribute)
        name = f'PandasIndexer:{self._axis_name}.{self._accessor}.{attribute}'

        def indexer_factory(*args, **kwargs):
            def indexer(df):
                axis = getattr(df, self._axis_name)
                accessor = getattr(axis, self._accessor)
                method = getattr(accessor, attribute)
                return method(*args, **kwargs)

            bound_arguments = signature(dummy_attr).bind(*args, **kwargs)
            indexer.__qualname__ = (
                name + str(bound_arguments).replace('<BoundArguments ', '')[:-1]
            )
            indexer.__signature__ = Signature()
            return indexer

        indexer_factory.__name__ = name
        indexer_factory.__qualname__ = name
        indexer_factory.__signature__ = signature(dummy_attr)
        return indexer_factory

    def __getattr__(self, attribute):
        return self._create_indexer(attribute)

    def __dir__(self):
        """Make it work with auto-complete in IPython"""
        return dir(getattr(self._dummy_series, self._accessor))


columns = PandasIndexer('columns')
Enesco answered 10/5, 2021 at 13:3 Comment(1)
I actually like to set it as Column = PandasIndexer('columns') so that it i obvious that I am playing with magic behaviour rather than using a global variable, as in df.loc[:, Column.startswith('foo')]; this makes it also reminiscent of SQLAlchemy (and intuitive to those who used such an ORM).Enesco
S
0

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0
Semela answered 3/12, 2014 at 16:21 Comment(0)
T
0

One option is with pyjanitor's select function:

# pip install pyjanitor
import janitor 
import pandas as pd

df.select(columns='foo*')
Out[32]: 
   foo.aa  foo.fighters  foo.bars  foo.fox foo.manchu
0     1.0           0.0         0        2         NA
1     2.1           1.0         0        4          0
2     NaN           NaN         0        1          0
3     4.7           0.0         0        0          0
4     5.6           0.0         0        0          0
5     6.8           0.0         1        5          0

To get your answer on both rows and columns:

df.select(rows=df.select(columns='foo*').eq(1).any(axis=1), columns='foo*')
Out[36]: 
   foo.aa  foo.fighters  foo.bars  foo.fox foo.manchu
0     1.0           0.0         0        2         NA
1     2.1           1.0         0        4          0
2     NaN           NaN         0        1          0
5     6.8           0.0         1        5          0

Of course you can just pass the boolean to .loc:

df.loc[df.select(columns='foo*').eq(1).any(axis=1)]
Out[38]: 
   foo.aa  foo.fighters  foo.bars  bar.baz  foo.fox nas.foo foo.manchu
0     1.0           0.0         0      5.0        2      NA         NA
1     2.1           1.0         0      5.0        4       0          0
2     NaN           NaN         0      6.0        1       1          0
5     6.8           0.0         1      6.8        5       0          0
Tobiastobie answered 7/10, 2023 at 12:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.