quantile normalization on pandas dataframe
Asked Answered
M

9

19

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

Mongol answered 21/6, 2016 at 5:1 Comment(4)
I removed the "R" tag since you (1) aren't using R and (2) don't want R in the answer. But if you say "R cannot compute the correct result", it sounds like you are either disparaging R (to what end?) or want somebody to correct your unposted code. Either way, perhaps I'm misunderstanding what you want: quantile normalization needs a source and target distribution and I'm not certain which you're providing here. Can you clarify, please?Neurasthenia
@Neurasthenia Thanks for your comment and I already edited the question. FYI, the code I googled runs R as subprocess of Python. After run R directly I found that the result was wrong. Besides, I'm not sure about what do you mean by 'target distribution. According to the Wiki, the computation of quantile normalization doesn't involve that term. The question, hopefully I made it clear, is to apply quantile normalization on the data I gave.Mongol
You are right, my term of "target" isn't really good. The wiki references "making two distributions identical", so I was wondering what your two distributions were. Now that you provided additional code (and data, defined as matrix), I'm confused about which is your actual data to be quant-normed. (Perhaps a stupid question, but is it possible that the matrix is transposed compared with what you actually need?)Neurasthenia
@Neurasthenia I'm sorry for the confusion I caused. FYI, the actual data is a (2119055,124) matrix. Data I gave above is the tiny subset of it for testing. And yes, I did consider the question of transpose. As you could see, in the sample code, matrix is (3,5), but the normalized result is (5,3), therefore I summarized that to use this code I need to transpose the matrix first. To be more clear, my data is (4,6) and to use the code I will assign transposed data, i.e. (6,4) to variable matrix, and then continue.Mongol
M
9

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

Mongol answered 22/6, 2016 at 2:16 Comment(0)
N
22

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667
Nutrition answered 10/12, 2016 at 18:21 Comment(5)
elegant use of groupby, map, and stacking/unstacking. are you a pandas developer?Intine
Thanks. No, I am just a regular user.Nutrition
@ayhan Why did you do different ranking method in the first and second processing line, i.e. first vs min?Drambuie
Just pointing out (and self-promoting) that this does not generate the "correct" results according to Wikipedia. I implemented a fast method that does produce the correct results and is installable with conda or pip: https://mcmap.net/q/635196/-quantile-normalization-on-pandas-dataframeArgentum
This looks great! The only thing needed to fix the values with the same rank is to use the average, and interpolate the rank_mean in between. I added the adjustment in my post. https://mcmap.net/q/635196/-quantile-normalization-on-pandas-dataframeKnobkerrie
M
9

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

Mongol answered 22/6, 2016 at 2:16 Comment(0)
L
3

One thing worth noticing is that both ayhan and shawn's code use the smaller rank mean for ties, but if you use R package processcore's normalize.quantiles() , it would use the mean of rank means for ties.

Using the above example:

> df

   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

> normalize.quantiles(as.matrix(df))

         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
Lefler answered 3/1, 2020 at 17:37 Comment(1)
Just here to say that I made a package/answer called qnorm for Python which does handle ties: https://mcmap.net/q/635196/-quantile-normalization-on-pandas-dataframeArgentum
K
2

This is a minor adjustment but I imagine that many have noticed the subtle 'flaw' in @ayhan 's answer.

I have made a small adjustment to it, which gets the 'correct' answer, while not having to resort to any external libraries for such an exceedingly simple function.

The only adjustment necessary is the [Add interpolated values] section.

import pandas as pd

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

def quant_norm(df):
    ranks = (df.rank(method="first")
              .stack())
    rank_mean = (df.stack()
                   .groupby(ranks)
                   .mean())
    # Add interpolated values in between ranks
    finer_ranks = ((rank_mean.index+0.5).to_list() +
                    rank_mean.index.to_list())
    rank_mean = rank_mean.reindex(finer_ranks).sort_index().interpolate()
    return (df.rank(method='average')
              .stack()
              .map(rank_mean)
              .unstack())
quant_norm(df)

Out[122]: 
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
Knobkerrie answered 19/5, 2021 at 5:11 Comment(0)
L
1

The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

import numpy as np

def quantile_normalize_using_target(x, target):
    """
    Both `x` and `target` are numpy arrays of equal lengths.
    """

    target_sorted = np.sort(target)

    return target_sorted[x.argsort().argsort()]

Once you have a pandas.DataFrame easy to do:

quantile_normalize_using_target(df[0].as_matrix(),
                                df[1].as_matrix())

(Normalizing the first columnt to the second one as a reference distribution in the example above.)

Lordan answered 2/5, 2017 at 13:32 Comment(0)
E
0

Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic[col] = df[col].sort_values(na_position='first').values
    sorted_df = pd.DataFrame(dic)
    #rank = sorted_df.mean(axis = 1).tolist()
    rank = sorted_df.median(axis = 1).tolist()
    #sort
    for col in df:
        # compute percentile rank [0,1] for each score in column 
        t = df[col].rank( pct=True, method='max' ).values
        # replace percentile values in column with quantile normalized score
        # retrieve q_norm score using calling rank with percentile value
        df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
    return df
Edifice answered 6/4, 2017 at 15:56 Comment(0)
V
0

I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

def quantile_normalize(dataframe, cols, pandas=pd):

    # copy dataframe and only use the columns with numerical values
    df = dataframe.copy().filter(items=cols)

    # columns from the original dataframe not specified in cols
    non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))


    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  

    norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()


    result = pandas.concat([norm, non_numeric], axis=1)
    return result

the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

e.g. if you added some 'meta-data' (char) to the wiki example:

df = pd.DataFrame({
    'rep1': [5, 2, 3, 4],
    'rep2': [4, 1, 4, 2],
    'rep3': [3, 4, 6, 8],
    'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])

you can then call

quantile_normalize(t, ['rep1', 'rep2', 'rep3'])

to get

    rep1        rep2        rep3        char
a   5.666667    4.666667    2.000000    gene_a
b   2.000000    2.000000    3.000000    gene_b
c   3.000000    4.666667    4.666667    gene_c
d   4.666667    3.000000    5.666667    gene_d
Vein answered 14/1, 2018 at 13:20 Comment(0)
A
0

As pointed out by @msg, none of the solutions here take ties into account. I made a python package called qnorm which handles ties, and correctly recreates the Wikipedia quantile normalization example:

import pandas as pd
import qnorm

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

print(qnorm.quantile_normalize(df))
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667

Installation can be done with either pip or conda

pip install qnorm

or

conda config --add channels conda-forge
conda install qnorm
Argentum answered 8/7, 2020 at 10:2 Comment(7)
How is this different than specifying method='average' in his df.rank()?Utgardloki
@Utgardloki I can't clearly put dataframes in comments, but why don't you try it? I get different results with method='average', and qnorm. It just resolves ties differently.Argentum
I tried to use your package and it threw an non precise type array error. my input is a dataframe as well, any idea how to resolve it?Candelabrum
@Candelabrum thanks for letting me know, I never had problems with this. What is type of each column? You can check with df.dtypes. If you have a github account, can you make an issue on the github page about it, with a small piece of code that reproduces the error? Then I should be able to fix it quickly: github.com/Maarten-vd-Sande/qnorm/issues/newArgentum
@Candelabrum I did some checking around and it happens when you use "non standard" datatypes (e.g. float16). It will still crash, but now output a message telling you to convert to e.g. float32Argentum
@Maarten-vd-Sande, Hi sorry I missed your first message. But yes, I tried to convert it to floats and it worked!Candelabrum
@Candelabrum Great! The newest release should do the conversion automatically for youArgentum
P
0

Note that scikit-learn offers a module for quantile normalization:

import pandas as pd
import sklearn.preprocessing

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
sklearn.preprocessing.quantile_transform(df)
array([[1.  , 1.  , 0.  ],
       [0.  , 0.  , 0.33],
       [0.33, 1.  , 0.67],
       [0.67, 0.33, 1.  ]])

It is also possible to adjust the data to a normal distribution instead of a uniform distribution:

sklearn.preprocessing.quantile_transform(df, output_distribution="normal")
array([[ 5.2 ,  5.2 , -5.2 ],
       [-5.2 , -5.2 , -0.43],
       [-0.43,  5.2 ,  0.43],
       [ 0.43, -0.43,  5.2 ]])
Pigfish answered 31/8, 2022 at 11:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.