axis = 0 seems to behave different in sum() and dropna()

Asked 31/3, 2018 at 11:23 Answered 1/11, 2024 at 13:32

From reading the pandas documentation, and a good question and answer (What does axis in pandas mean?), I had expected axis=0 to always mean with respect to columns. This works for me when I work with sum(), but works the other way around when I use the dropna() call.

When i Have a dataframe like this:

raw_data = {'column1': [42,13, np.nan, np.nan],
        'column2': [4,12, np.nan, np.nan],
        'column3': [25,61, np.nan, np.nan]}

Which looks like this:

   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0
2      NaN      NaN      NaN
3      NaN      NaN      NaN

I can print the sums for the respective columns, with axis=0. And this:

df = pd.DataFrame(raw_data )
print(df.sum(axis=0))

Gives the output:

column1    55.0
column2    16.0
column3    86.0

When I try to drop values from the dataframe with axis=0, this should again be with respect to columns*. But when I do:

dfclear=df.dropna(axis=0,how='all')
print(dfclear)

I get the output:

column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0

Where I had expected the following (which I get with axis=1):

   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0
2      NaN      NaN      NaN
3      NaN      NaN      NaN

So it seems to me that axis behaves differently between sum() and dropna()

Is there something I'm missing here?

*https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

Sleeper answered 31/3, 2018 at 11:23 Comment(1)

I never got to an understanding of this. And as I read the answers they don't seem to adress why the axis command behaves differently between the two. Completely possible that I have just overlooked something. – Sleeper 2/4, 2018 at 10:53

from the docstring:

In [41]: df.dropna?
Signature: df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, or tuple/list thereof
    Pass tuple or list to drop on multiple axes
...

if you are not sure what axis is, use the following method:

In [39]: df.dropna(axis='index', how='all')
Out[39]:
   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0

In [40]: df.dropna(axis='columns', how='all')
Out[40]:
   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0
2      NaN      NaN      NaN
3      NaN      NaN      NaN

Luddite answered 31/3, 2018 at 11:31 Comment(3)

In the pandas docs, it says for dropna: axis : {0 or ‘index’, 1 or ‘columns’} and for sum: axis : {index (0), columns (1)} So it should be the same for both. Though in my example they behave opposite of each other, as far as I can see. – Sleeper 31/3, 2018 at 11:38

@Simon, it looks correct to me: Return object with labels on given axis omitted – Luddite 31/3, 2018 at 11:41

Okay. But did you see the part in my question with sum? Thats returns results for each column, not for each row, and that's with axis = 0 – Sleeper 31/3, 2018 at 14:23

I think the answer is correct :

print(df)

produces below output:

   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0
2      NaN      NaN      NaN
3      NaN      NaN      NaN

dfclear=df.dropna(axis=0,how='all')
print(dfclear)

Produces below output:

   column1  column2  column3
0     42.0      4.0     25.0
1     13.0     12.0     61.0

From Pandas Documentation Sample Explaination :

Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same)

Cheesecloth answered 31/3, 2018 at 11:31 Comment(0)

Mind you, pandas shift also has counter intuitive axis meaning, where 0 means by raw and 1 means by column.

I guess they need to address these and other similar points in their documentation somewhere

Gertrudgertruda answered 4/11, 2019 at 6:26 Comment(0)

pandas.DataFrame.sum follows the numpy convention. In numpy.ndarray, axis represents the direction in which the data is squished into. It works for any number of dimensions. Refer this answer.

But pandas.Dataframe can only be 2D. And pandas lost the axis convention when implementing their pandas.Dataframe.dropna function. It is an inconsistency for which there is no explanation. Refer this answer

Barrybarrymore answered 1/11, 2024 at 13:32 Comment(0)

Recommended topics

Hot tags