Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code |  14  |  17  |  19  | ...
w1   |  0   |   5  |   3  | ...
w2   |  2   |   5  |   4  | ... 
w3   |  0   |   0  |   5  | ...

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.

The desired output would be something like:

     | [14,17]  |  [14,19]  |  [14,...]  |  [17,19]  | ...
Sim  |cs(14,17) |cs(14,19)  |cs(14,...)  |cs(17,19)..| ...

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)

To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
#        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()

import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)

yields the Series

17  14    0.292893
    19    0.300000
19  14    0.434315
    17    0.300000

Recommended topics

Hot tags