Naturally sorting Pandas DataFrame
Asked Answered
T

3

39

I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?

from natsort import natsorted
import pandas as pd

# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted 
c = natsorted(a)

# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)

print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)
Thermopylae answered 11/4, 2015 at 17:40 Comment(5)
@sethMMorton I supposed I would expect df3.index to be the same as c while sorting the data to keep it inline with its index valuesThermopylae
It would be nice if pd.sort had a key option, but it does not. This answer provides a workaround that would let you pass a key generated from natsort_keygen.Germicide
I just made an official request to the pandas devs to add key to the sort methods here: github.com/pydata/pandas/issues/9855Germicide
My above issue was a dupe, the active issue is github.com/pydata/pandas/issues/3942Germicide
Now that pandas has a key argument to sort_values, https://mcmap.net/q/399399/-naturally-sorting-pandas-dataframe should now be the accepted answer.Germicide
P
15

If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

Note that df.index = natsorted(df.index) also works

if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:

In [10]:

natsorted(df)
Out[10]:
[]

EDIT

If you want to sort the index so that the data is reordered along with the index then use reindex:

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
       0
0hr    0
128hr  1
72hr   2
48hr   3
96hr   4
In [14]:

df = df*2
df
Out[14]:
       0
0hr    0
128hr  2
72hr   4
48hr   6
96hr   8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
       0
0hr    0
48hr   6
72hr   4
96hr   8
128hr  2

Note that you have to assign the result of reindex to either a new df or to itself, it does not accept the inplace param.

Praise answered 11/4, 2015 at 17:43 Comment(13)
Hi, natsort developer here. natsort currently does not have any explicit support for handling entire dataframe objects. What would your expected output be for passing a dataframe object?Germicide
I believe this misses the point. I realize I can naturally sort the a and used that as the index, but my actual code messes up the sorting of the data frame index because of the manipulations I perform on the dataframe. I need to resort the index and associated data while it is in the dataframe.Thermopylae
So what is it you're asking here, you want to natsort the index after data manipulations? You can use reindex and call natsorted on the index df.reindex(index=natsorted(df.index))Praise
@Praise Yes, it sounds exactly like that is what they want. I think ultimately that is the correct answer.Germicide
@Praise This appears to work but only if I put it in a few dataframe df2 = df.reindex(index=natsorted(df.index)) performing the operation in place as you show doesn't seem to work.Thermopylae
@Thermopylae Use inplace=TrueGermicide
@Praise BTW, on my machine natsorted(df) returns the columns of the dataframe sorted, which is the same behavior of sorted(df).Germicide
@Germicide sorry reindex is one of the few functions that does not accept param inplace, so yes you have to assign it to itselfPraise
@Praise That seems like an oversight... sort of like omitting key from sort!Germicide
@Germicide ah that output was based on an empty dataframe, the iterable returned from a df is the columns and yes I get the same result when my df has columnsPraise
@Germicide I'm not a pandas dev so I don't know why it's not in therePraise
@Praise it's not sorted(df) I'm trying to replicate but df.sort(). I believe your reindex trick is a good work around to no key value available for df.sort()Thermopylae
@Thermopylae I think Seth was interested as to why I got an empty list, I was just confirming that the behaviour is consistent with sorted(df) so unrelated to your questionPraise
I
57

Using sort_values for pandas >= 1.1.0

With the new key argument in DataFrame.sort_values, since pandas 1.1.0, we can directly sort a column without setting it as an index using natsort.natsort_keygen:

df = pd.DataFrame({
    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    "value": [10, 20, 30, 40, 50]
})

    time  value
0    0hr     10
1  128hr     20
2   72hr     30
3   48hr     40
4   96hr     50
from natsort import natsort_keygen

df.sort_values(
    by="time",
    key=natsort_keygen()
)

    time  value
0    0hr     10
3   48hr     40
2   72hr     30
4   96hr     50
1  128hr     20
Indehiscent answered 14/9, 2020 at 19:47 Comment(5)
This proposed solution is a bit of a "maximum effort" solution - would not key=natsort_keygen() be less effort?Germicide
Agree, updated my answer accordingly. Thanks for the heads up and beautiful package you wrote :) @GermicideIndehiscent
If I try to sort 2 columns of different type, like this df.sort_values(['Title', 'Copies'], ascending=[False, True], key=natsort_keygen()), I get this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I got the dataframe from pd.read_csv, providing the column names and types. Any idea how to fix it?Tao
@Agostino, natsort_key should work, also PyICU is another option for natural sorting.Bedlamite
@Agostino, a belated response, but with current versions of natsort, it would be from natsort import natsort_key and df.sort_values(['Title', 'Copies'], ascending=[False, True], key=natsort_key())Bedlamite
G
36

Now that pandas has support for key in both sort_values and sort_index you should now refer to this other answer and send all upvotes there as it is now the correct answer.

I will leave my answer here for people stuck on old pandas versions, or as a historical curiosity.


The accepted answer answers the question being asked. I'd like to also add how to use natsort on columns in a DataFrame, since that will be the next question asked.

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted, index_natsorted, order_by_index

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df
Out[4]: 
         a   b
0hr     a5  b1
128hr   a1  b1
72hr   a10  b2
48hr    a2  b2
96hr   a12  b1

As the accepted answer shows, sorting by the index is fairly straightforward:

In [5]: df.reindex(index=natsorted(df.index))
Out[5]: 
         a   b
0hr     a5  b1
48hr    a2  b2
72hr   a10  b2
96hr   a12  b1
128hr   a1  b1

If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort provides the convenience functions index_natsorted and order_by_index to do just that.

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip (or itertools.izip on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]: 
         a   b
0hr     a5  b1
96hr   a12  b1
128hr   a1  b1
48hr    a2  b2
72hr   a10  b2

Here is an alternate method using Categorical objects that I have been told by the pandas devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex which will allow this method to be used on an index.

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

The Categorical object lets you define a sorting order for the DataFrame to use. The elements given when calling reorder_categories must be unique, hence the call to set for column "b".

I leave it to the user to decide if this is better than the reindex method or not, since it requires you to sort the column data independently before sorting within the DataFrame (although I imagine that second sort is rather efficient).


Full disclosure, I am the natsort author.

Germicide answered 11/4, 2015 at 20:28 Comment(0)
P
15

If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

Note that df.index = natsorted(df.index) also works

if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:

In [10]:

natsorted(df)
Out[10]:
[]

EDIT

If you want to sort the index so that the data is reordered along with the index then use reindex:

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
       0
0hr    0
128hr  1
72hr   2
48hr   3
96hr   4
In [14]:

df = df*2
df
Out[14]:
       0
0hr    0
128hr  2
72hr   4
48hr   6
96hr   8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
       0
0hr    0
48hr   6
72hr   4
96hr   8
128hr  2

Note that you have to assign the result of reindex to either a new df or to itself, it does not accept the inplace param.

Praise answered 11/4, 2015 at 17:43 Comment(13)
Hi, natsort developer here. natsort currently does not have any explicit support for handling entire dataframe objects. What would your expected output be for passing a dataframe object?Germicide
I believe this misses the point. I realize I can naturally sort the a and used that as the index, but my actual code messes up the sorting of the data frame index because of the manipulations I perform on the dataframe. I need to resort the index and associated data while it is in the dataframe.Thermopylae
So what is it you're asking here, you want to natsort the index after data manipulations? You can use reindex and call natsorted on the index df.reindex(index=natsorted(df.index))Praise
@Praise Yes, it sounds exactly like that is what they want. I think ultimately that is the correct answer.Germicide
@Praise This appears to work but only if I put it in a few dataframe df2 = df.reindex(index=natsorted(df.index)) performing the operation in place as you show doesn't seem to work.Thermopylae
@Thermopylae Use inplace=TrueGermicide
@Praise BTW, on my machine natsorted(df) returns the columns of the dataframe sorted, which is the same behavior of sorted(df).Germicide
@Germicide sorry reindex is one of the few functions that does not accept param inplace, so yes you have to assign it to itselfPraise
@Praise That seems like an oversight... sort of like omitting key from sort!Germicide
@Germicide ah that output was based on an empty dataframe, the iterable returned from a df is the columns and yes I get the same result when my df has columnsPraise
@Germicide I'm not a pandas dev so I don't know why it's not in therePraise
@Praise it's not sorted(df) I'm trying to replicate but df.sort(). I believe your reindex trick is a good work around to no key value available for df.sort()Thermopylae
@Thermopylae I think Seth was interested as to why I got an empty list, I was just confirming that the behaviour is consistent with sorted(df) so unrelated to your questionPraise

© 2022 - 2024 — McMap. All rights reserved.