Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns
Asked Answered
D

3

43

I have a dataframe which has several columns, so I chose some of its columns to create a variable like this.

xtrain = df[['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title']]

I want to drop from these columns all rows where the Survive column in the main dataframe is nan.

Dower answered 27/12, 2016 at 0:6 Comment(1)
D
60

You can pass a boolean mask to your df based on notnull() of 'Survive' column and select the cols of interest:

In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
    Survive       Age      Fare  Group_Size      deck    Pclass     Title
0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Now pass a mask to loc to take only non NaN rows:

In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain

Out[3]:
        Age      Fare  Group_Size      deck    Pclass     Title
0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482
Dissatisfied answered 27/12, 2016 at 0:9 Comment(2)
Just wish to know why the 'Survive' column is completely off in the output? The question asks for dropping all rows that have NaNs, not the entire columns that may have one or more NaNs.Drip
@Drip the original question asked for an output with these columns ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title'] (the OP explained the "Survive" column was in the original data, but not requested in the output). The "Survive" column is not included in the output bc it is not in the 'columns_indexer' list in the .loc call. i.e. df.loc[row_indexer, column_indexer]. See pandas.pydata.org/pandas-docs/stable/user_guide/… for a complete explanation.Muntjac
F
10

Two alternatives because... well why not?
Both drop nan prior to column slicing. That's two call rather than EdChum's one call.

one

df.dropna(subset=['Survive'])[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

two

df.query('Survive == Survive')[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
Franciskus answered 27/12, 2016 at 0:27 Comment(1)
df.dropna(subset=['Survive'])[['Survive','Age','Fare', 'Group_Size','deck', 'PCLass', 'Title' ]] will retain the 'Survive' column too.Drip
O
1

It might be more readable if you assign the subset of the columns to a variable and filter.

notna_msk = df['Survive'].notna()
cols = ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title', 'Survive']
new_df = df.loc[notna_msk, cols]

Also, in case you already created xtrain from df as in the OP, then you can still filter this dataframe with the mask, even if it doesn't have Survive column; just the index is enough.

new_df = xtrain.loc[df['Survive'].notna()]
Oleaginous answered 14/2, 2023 at 22:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.