Asked 12/12, 2012 at 11:9 Answered 3/6, 2021 at 22:27

157

I have python pandas dataframe, in which a column contains month name.

How can I do a custom sort using a dictionary, for example:

custom_dict = {'March':0, 'April':1, 'Dec':3}

Makeweight answered 12/12, 2012 at 11:9 Comment(3)

Does a columns contain month name mean that there is a column which contains month names (as my answer), or many columns with column names as month names (as eumiro's)? – Regain 12/12, 2012 at 11:51

The accepted answer is outdated, and is also technically incorrect, as pd.Categorical does not interpret the categories as ordered by default. See this answer. – President 22/1, 2019 at 4:34

@AndyHayden: clearly: "one single column, which contains month names (as string or categorical)" (your interpretation, not eumiro's) – Nivernais 25/3 at 9:39

244

Pandas 0.15 introduced Categorical Series, which allows a much clearer way to do this:

First make the month column a categorical and specify the ordering to use.

In [21]: df['m'] = pd.Categorical(df['m'], ["March", "April", "Dec"])

In [22]: df  # looks the same!
Out[22]:
   a  b      m
0  1  2  March
1  5  6    Dec
2  3  4  April

Now, when you sort the month column it will sort with respect to that list:

In [23]: df.sort_values("m")
Out[23]:
   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

Note: if a value is not in the list it will be converted to NaN.

An older answer for those interested...

You could create an intermediary series, and set_index on that:

df = pd.DataFrame([[1, 2, 'March'],[5, 6, 'Dec'],[3, 4, 'April']], columns=['a','b','m'])
s = df['m'].apply(lambda x: {'March':0, 'April':1, 'Dec':3}[x])
s.sort_values()

In [4]: df.set_index(s.index).sort()
Out[4]: 
   a  b      m
0  1  2  March
1  3  4  April
2  5  6    Dec

As commented, in newer pandas, Series has a replace method to do this more elegantly:

s = df['m'].replace({'March':0, 'April':1, 'Dec':3})

The slight difference is that this won't raise if there is a value outside of the dictionary (it'll just stay the same).

Regain answered 12/12, 2012 at 11:44 Comment(12)

s = df['m'].replace({'March':0, 'April':1, 'Dec':3}) works for line 2 as well -- just for the sake of anyone learning pandas like me – Synergy 1/10, 2014 at 0:11

@Synergy good spot! (been a while since I wrote this!) replace definitely best option, another is to use .apply({'March':0, 'April':1, 'Dec':3}.get) :) In 0.15 we'll have Categorical Series/columns, so the best way will be to use that and then sort will just work. – Regain 1/10, 2014 at 19:9

@AndyHayden I've taken the liberty of replacing the second line with the 'replace' method. I hope that is Ok. – Swindell 1/2, 2015 at 13:55

@AndyHayden edit rejected, but I still think it is a reasonable change. – Swindell 1/2, 2015 at 16:34

@FaheemMitha I missed it, thanks for the edit! I've appended your comment and also updated with the newer categorical method - which I think is a little neater :). – Regain 3/2, 2015 at 4:54

@AndyHayden the categorical method is returning the SettingWithCopyWarning in pandas 0.18.1 is there a new idiom to follow? – Ferrara 17/5, 2016 at 13:39

@toasteez I can't reproduce that in 0.18.1 myself, what are you doing specifically to get that? (There's a warning on sort; new api is sort_values). – Regain 17/5, 2016 at 22:28

@AndyHayden I've logged a bug my code example Bug logged #13204 – Ferrara 17/5, 2016 at 22:36

Just make sure you use df.sort_values("m") in newer pandas (instead of df.sort("m")), otherwise you'll get a AttributeError: 'DataFrame' object has no attribute 'sort' ;) – Filigreed 12/6, 2018 at 0:10

The first solution does not really solve the problem of custom sorting without typing in the explicit order, or am I missing something? For example, how would I sort the m column by length of the month's name using pd.Categorical? – Zebedee 24/6, 2018 at 10:48

'Unknown' categories and None is not supported as categorical series. – Dizon 2/6, 2021 at 19:43

Note that this would break aggregation because of issue <github.com/pandas-dev/pandas/issues/13204>. – Ballman 19/7, 2022 at 8:7

pandas >= 1.1

You will soon be able to use sort_values with key argument:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

custom_dict = {'March': 0, 'April': 1, 'Dec': 3} 
df

   a  b      m
0  1  2  March
1  5  6    Dec
2  3  4  April

df.sort_values(by=['m'], key=lambda x: x.map(custom_dict))

   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

The key argument takes as input a Series and returns a Series. This series is internally argsorted and the sorted indices are used to reorder the input DataFrame. If there are multiple columns to sort on, the key function will be applied to each one in turn. See Sorting with keys.

pandas <= 1.0.X

One simple method is using the output Series.map and Series.argsort to index into df using DataFrame.iloc (since argsort produces sorted integer positions); since you have a dictionary; this becomes easy.

df.iloc[df['m'].map(custom_dict).argsort()]

   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

If you need to sort in descending order, invert the mapping.

df.iloc[(-df['m'].map(custom_dict)).argsort()]

   a  b      m
1  5  6    Dec
2  3  4  April
0  1  2  March

Note that this only works on numeric items. Otherwise, you will need to workaround this using sort_values, and accessing the index:

df.loc[df['m'].map(custom_dict).sort_values(ascending=False).index]

   a  b      m
1  5  6    Dec
2  3  4  April
0  1  2  March

More options are available with astype (this is deprecated now), or pd.Categorical, but you need to specify ordered=True for it to work correctly.

# Older version,
# df['m'].astype('category', 
#                categories=sorted(custom_dict, key=custom_dict.get), 
#                ordered=True)
df['m'] = pd.Categorical(df['m'], 
                         categories=sorted(custom_dict, key=custom_dict.get), 
                         ordered=True)

Now, a simple sort_values call will do the trick:

df.sort_values('m')
 
   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

The categorical ordering will also be honoured when groupby sorts the output.

President answered 22/1, 2019 at 4:19 Comment(2)

You've already emphasized it, but I'd like to reiterate in case someone else skims and misses it: Pandas Categorical sets ordered=None by default. If not set, the ordering will be wrong, or break on V23. Max function in particular gives a TypeError (Categorical is not ordered for operation max). – Fellowman 22/10, 2019 at 23:1

+1 for the sort_values() with key and map. That is much easier than using Categoricals. And it works with Index too. – Flagitious 18/11, 2022 at 17:26

Update

use the selected answer! it's newer than this post and is not only the official way to maintain ordered data in pandas, it's better in every respect, including features/performance, etc. Don't use my hacky method I describe below.

I'm only writing this update because people keep upvoting my answer, but it's definitely worse than the accepted one :)

Original post

A bit late to the game, but here's a way to create a function that sorts pandas Series, DataFrame, and multiindex DataFrame objects using arbitrary functions.

I make use of the df.iloc[index] method, which references a row in a Series/DataFrame by position (compared to df.loc, which references by value). Using this, we just have to have a function that returns a series of positional arguments:

def sort_pd(key=None,reverse=False,cmp=None):
    def sorter(series):
        series_list = list(series)
        return [series_list.index(i) 
           for i in sorted(series_list,key=key,reverse=reverse,cmp=cmp)]
    return sorter

You can use this to create custom sorting functions. This works on the dataframe used in Andy Hayden's answer:

df = pd.DataFrame([
    [1, 2, 'March'],
    [5, 6, 'Dec'],
    [3, 4, 'April']], 
  columns=['a','b','m'])

custom_dict = {'March':0, 'April':1, 'Dec':3}
sort_by_custom_dict = sort_pd(key=custom_dict.get)

In [6]: df.iloc[sort_by_custom_dict(df['m'])]
Out[6]:
   a  b  m
0  1  2  March
2  3  4  April
1  5  6  Dec

This also works on multiindex DataFrames and Series objects:

months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

df = pd.DataFrame([
    ['New York','Mar',12714],
    ['New York','Apr',89238],
    ['Atlanta','Jan',8161],
    ['Atlanta','Sep',5885],
  ],columns=['location','month','sales']).set_index(['location','month'])

sort_by_month = sort_pd(key=months.index)

In [10]: df.iloc[sort_by_month(df.index.get_level_values('month'))]
Out[10]:
                 sales
location  month  
Atlanta   Jan    8161
New York  Mar    12714
          Apr    89238
Atlanta   Sep    5885

sort_by_last_digit = sort_pd(key=lambda x: x%10)

In [12]: pd.Series(list(df['sales'])).iloc[sort_by_last_digit(df['sales'])]
Out[12]:
2    8161
0   12714
3    5885
1   89238

To me this feels clean, but it uses python operations heavily rather than relying on optimized pandas operations. I haven't done any stress testing but I'd imagine this could get slow on very large DataFrames. Not sure how the performance compares to adding, sorting, then deleting a column. Any tips on speeding up the code would be appreciated!

Indiction answered 19/11, 2014 at 5:40 Comment(3)

Would this work for sorting multiple columns/indices? – Criticaster 28/5, 2017 at 8:12

yes, but the selected answer is a far better way to do this. If you have multiple indices, just arrange them according to the sort order you prefer, then use df.sort_index() to sort all index levels. – Indiction 18/8, 2017 at 20:13

Categorical approach has a lot of limitations. – Dizon 2/6, 2021 at 19:45

import pandas as pd
custom_dict = {'March':0,'April':1,'Dec':3}

df = pd.DataFrame(...) # with columns April, March, Dec (probably alphabetically)

df = pd.DataFrame(df, columns=sorted(custom_dict, key=custom_dict.get))

returns a DataFrame with columns March, April, Dec

Vibraharp answered 12/12, 2012 at 11:28 Comment(1)

This sorts the actual columns, rather than sorting rows based on the custom predicate on the column? – President 20/5, 2020 at 21:25

I had the same task but with an addition to sort on multiple columns.

One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".

But I had some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.

So my solution was to use a key to sort on multiple columns with a custom sorting order:

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

Output:

Be aware that this solution can be slow.

Dizon answered 3/6, 2021 at 22:27 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

pandas >= 1.1

pandas <= 1.0.X

Update

Original post

Recommended topics

Hot tags