Python Pandas: Get index of rows where column matches certain value
Asked Answered
D

8

544

Given a DataFrame with a column "BoolCol", we want to find the indexes of the DataFrame in which the values for "BoolCol" == True

I currently have the iterating way to do it, which works perfectly:

for i in range(100,3000):
    if df.iloc[i]['BoolCol']== True:
         print i,df.iloc[i]['BoolCol']

But this is not the correct pandas way to do it. After some research, I am currently using this code:

df[df['BoolCol'] == True].index.tolist()

This one gives me a list of indexes, but they don't match, when I check them by doing:

df.iloc[i]['BoolCol']

The result is actually False!!

Which would be the correct pandas way to do this?

Disavow answered 15/2, 2014 at 16:18 Comment(0)
F
788

df.iloc[i] returns the ith row of df. i does not refer to the index label, i is a 0-based index.

In contrast, the attribute index returns actual index labels, not numeric row-indices:

df.index[df['BoolCol'] == True].tolist()

or equivalently,

df.index[df['BoolCol']].tolist()

You can see the difference quite clearly by playing with a DataFrame with a non-default index that does not equal to the row's numerical position:

df = pd.DataFrame({'BoolCol': [True, False, False, True, True]},
       index=[10,20,30,40,50])

In [53]: df
Out[53]: 
   BoolCol
10    True
20   False
30   False
40    True
50    True

[5 rows x 1 columns]

In [54]: df.index[df['BoolCol']].tolist()
Out[54]: [10, 40, 50]

If you want to use the index,

In [56]: idx = df.index[df['BoolCol']]

In [57]: idx
Out[57]: Int64Index([10, 40, 50], dtype='int64')

then you can select the rows using loc instead of iloc:

In [58]: df.loc[idx]
Out[58]: 
   BoolCol
10    True
40    True
50    True

[3 rows x 1 columns]

Note that loc can also accept boolean arrays:

In [55]: df.loc[df['BoolCol']]
Out[55]: 
   BoolCol
10    True
40    True
50    True

[3 rows x 1 columns]

If you have a boolean array, mask, and need ordinal index values, you can compute them using np.flatnonzero:

In [110]: np.flatnonzero(df['BoolCol'])
Out[112]: array([0, 3, 4])

Use df.iloc to select rows by ordinal index:

In [113]: df.iloc[np.flatnonzero(df['BoolCol'])]
Out[113]: 
   BoolCol
10    True
40    True
50    True
French answered 15/2, 2014 at 16:28 Comment(0)
P
50

Can be done using numpy where() function:

import pandas as pd
import numpy as np

In [716]: df = pd.DataFrame({"gene_name": ['SLC45A1', 'NECAP2', 'CLIC4', 'ADC', 'AGBL4'] , "BoolCol": [False, True, False, True, True] },
       index=list("abcde"))

In [717]: df
Out[717]: 
  BoolCol gene_name
a   False   SLC45A1
b    True    NECAP2
c   False     CLIC4
d    True       ADC
e    True     AGBL4

In [718]: np.where(df["BoolCol"] == True)
Out[718]: (array([1, 3, 4]),)

In [719]: select_indices = list(np.where(df["BoolCol"] == True)[0])

In [720]: df.iloc[select_indices]
Out[720]: 
  BoolCol gene_name
b    True    NECAP2
d    True       ADC
e    True     AGBL4

Though you don't always need index for a match, but incase if you need:

In [796]: df.iloc[select_indices].index
Out[796]: Index([u'b', u'd', u'e'], dtype='object')

In [797]: df.iloc[select_indices].index.tolist()
Out[797]: ['b', 'd', 'e']
Postpone answered 30/12, 2016 at 21:53 Comment(0)
S
48

If you want to use your dataframe object only once, use:

df['BoolCol'].loc[lambda x: x==True].index
Sometimes answered 8/1, 2020 at 11:30 Comment(0)
D
5

First you may check query when the target column is type bool (PS: about how to use it please check link )

df.query('BoolCol')
Out[123]: 
    BoolCol
10     True
40     True
50     True

After we filter the original df by the Boolean column we can pick the index .

df=df.query('BoolCol')
df.index
Out[125]: Int64Index([10, 40, 50], dtype='int64')

Also pandas have nonzero, we just select the position of True row and using it slice the DataFrame or index

df.index[df.BoolCol.values.nonzero()[0]]
Out[128]: Int64Index([10, 40, 50], dtype='int64')
Dani answered 9/1, 2019 at 22:24 Comment(0)
H
5

Simple way is to reset the index of the DataFrame prior to filtering:

df_reset = df.reset_index()
df_reset[df_reset['BoolCol']].index.tolist()

Bit hacky, but it's quick!

Host answered 21/6, 2019 at 4:56 Comment(0)
U
2

Another method is to use pipe() to pipe the indexing of the index of BoolCol. In terms of performance, it's as efficient as the canonical indexing using [].1

df['BoolCol'].pipe(lambda x: x.index[x])

This is especially useful if BoolCol is actually the result of multiple comparisons and you want to use method chaining to put all methods in a pipeline.

For example, if you want to get the row indexes where NumCol value is greater than 0.5, BoolCol value is True and the product of NumCol and BoolCol values is greater than 0, you can do so by evaluating an expression via eval() and call pipe() on the result to perform the indexing of the indexes.2

df.eval("NumCol > 0.5 and BoolCol and NumCol * BoolCol >0").pipe(lambda x: x.index[x])


1: The following benchmark used a dataframe with 20mil rows (on average filtered half of the rows) and retrieved their indexes. The method chaining via pipe() does very well compared to the other efficient options.

n = 20_000_000
df = pd.DataFrame({'NumCol': np.random.rand(n).astype('float16'), 
                   'BoolCol': np.random.default_rng().choice([True, False], size=n)})

%timeit df.index[df['BoolCol']]
# 181 ms ± 2.47 ms per loop (mean ± std. dev. of 10 runs, 1000 loops each)

%timeit df['BoolCol'].pipe(lambda x: x.index[x])
# 181 ms ± 1.08 ms per loop (mean ± std. dev. of 10 runs, 1000 loops each)

%timeit df['BoolCol'].loc[lambda x: x].index
# 297 ms ± 7.15 ms per loop (mean ± std. dev. of 10 runs, 1000 loops each)

2: For a 20 mil row dataframe constructed in the same way as in 1) for the benchmark, you will find that the method proposed here is the fastest option. It performs better than bitwise-operator chaining because by design, eval() performs multiple operations on a large dataframe faster than vectorized Python operations and it is more memory efficient than query() because unlike query(), eval().pipe(...) doesn't need to create a copy of the sliced dataframe to get its index.

Ultramicroscopic answered 10/9, 2022 at 16:33 Comment(0)
V
1

I extended this question that is how to gets the row, columnand value of all matches value?

here is solution:

import pandas as pd
import numpy as np


def search_coordinate(df_data: pd.DataFrame, search_set: set) -> list:
    nda_values = df_data.values
    tuple_index = np.where(np.isin(nda_values, [e for e in search_set]))
    return [(row, col, nda_values[row][col]) for row, col in zip(tuple_index[0], tuple_index[1])]


if __name__ == '__main__':
    test_datas = [['cat', 'dog', ''],
                  ['goldfish', '', 'kitten'],
                  ['Puppy', 'hamster', 'mouse']
                  ]
    df_data = pd.DataFrame(test_datas)
    print(df_data)
    result_list = search_coordinate(df_data, {'dog', 'Puppy'})
    print(f"\n\n{'row':<4} {'col':<4} {'name':>10}")
    [print(f"{row:<4} {col:<4} {name:>10}") for row, col, name in result_list]

Output:

          0        1       2
0       cat      dog        
1  goldfish           kitten
2     Puppy  hamster   mouse


row  col        name
0    1           dog
2    0         Puppy
Vicentevicepresident answered 12/9, 2019 at 3:35 Comment(1)
Without explicitly using numpy by using boolean dataframe: thispointer.com/…Powder
G
1

For known index candidate that we interested, a faster way by not checking the whole column can be done like this:

np.array(index_slice)[np.where(df.loc[index_slice]['column_name'] >= threshold)[0]]

Full comparison:

import pandas as pd
import numpy as np

index_slice = list(range(50,150)) # know index location for our inteterest
data = np.zeros(10000)
data[(index_slice)] = np.random.random(len(index_slice))

df = pd.DataFrame(
    {'column_name': data},
)

threshold = 0.5

%%timeit
np.array(index_slice)[np.where(df.loc[index_slice]['column_name'] >= threshold)[0]]
# 600 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
[i for i in index_slice if i in df.index[df['column_name'] >= threshold].tolist()]
# 22.5 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The way it works is like this:

# generate Boolean satisfy condition only in sliced column
df.loc[index_slice]['column_name'] >= threshold

# convert Boolean to index, but start from 0 and increment by 1
np.where(...)[0]

# list of index to be sliced
np.array(index_slice)[...]

Note: It needs to be noted that np.array(index_slice) can't be substituted by df.index due to np.where(...)[0] indexing start from 0 and increment by 1, but you can make something like df.index[index_slice]. And I think this is not worth the hassle if you just do it one time with small number of rows.

Gunlock answered 22/11, 2021 at 10:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.