Pandas: Apply function over each pair of columns under constraints
Asked Answered
F

1

3

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code |  14  |  17  |  19  | ...
w1   |  0   |   5  |   3  | ...
w2   |  2   |   5  |   4  | ... 
w3   |  0   |   0  |   5  | ...

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.

The desired output would be something like:

     | [14,17]  |  [14,19]  |  [14,...]  |  [17,19]  | ...
Sim  |cs(14,17) |cs(14,19)  |cs(14,...)  |cs(17,19)..| ...

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)

Flexure answered 19/7, 2016 at 10:0 Comment(2)
If I'm getting it straight, you wouldn't want cs(14,17) nor cs(14,19) etc. because there's no item in the '14' column that's greater than 5. And did you try anything? Could you please provide some code and examples that failed?Mencius
Hi, @danielhadar. Actually so far I've done few calculations by hand. I'm asking if is there any method to apply functions (cosine similarity in this case, but I will apply more functions) to each pair of columns in a vectorial way, i.e without writing loops over columns. The build of the last df is only to have a better visualization of the result, but it's not important.Flexure
R
4

To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
#        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()

import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)

yields the Series

17  14    0.292893
    19    0.300000
19  14    0.434315
    17    0.300000
Regress answered 19/7, 2016 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.