How generate all pairs of values, from the result of a groupby, in a pandas dataframe
Asked Answered
C

4

12

I have a pandas dataframe df:

ID     words
1      word1
1      word2
1      word3
2      word4
2      word5
3      word6
3      word7
3      word8
3      word9

I want to produce another dataframe that would generate all pairs of words in each group. So the result for the above would be:

ID     wordA    wordB
1      word1    word2
1      word1    word3
1      word2    word3
2      word4    word5
3      word6    word7
3      word6    word8
3      word6    word9
3      word7    word8
3      word7    word9
3      word8    word9

I know that I can used df.groupby['words'] to get the words within each ID.

I also know that I can use

iterable = ['word1','word2','word3']
list(itertools.combinations(iterable, 2))

to get all possible pairwise combinations. However, I'm a little lost as to the best way to generate a resulting dataframe as shown above.

Clein answered 3/12, 2017 at 13:32 Comment(4)
I have edited the title to better reflect the actual problem you are trying to solve.Harmattan
no the new title has nothing to do with what I want... thanks for the effort though...Clein
Possible duplicate: #43800890Harmattan
yeah I got the itertools part right. :) I need to figure out how to do it within the groupby please. I'm doing it for each ID.Clein
B
12

Its simple use itertools combinations inside apply and stack i.e

from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
                          .apply(pd.Series).stack().reset_index(level=0,name='words')

 ID           words
0   1  (word1, word2)
1   1  (word1, word3)
2   1  (word2, word3)
0   2  (word4, word5)
0   3  (word6, word7)
1   3  (word6, word8)
2   3  (word6, word9)
3   3  (word7, word8)
4   3  (word7, word9)
5   3  (word8, word9)

To match you exact output further we have to do

sdf = pd.concat([ndf['ID'],ndf['words'].apply(pd.Series)],1).set_axis(['ID','WordsA','WordsB'],1,inplace=False)

   ID WordsA WordsB
0   1  word1  word2
1   1  word1  word3
2   1  word2  word3
0   2  word4  word5
0   3  word6  word7
1   3  word6  word8
2   3  word6  word9
3   3  word7  word8
4   3  word7  word9
5   3  word8  word9

To convert it to a one line we can do :

combo = df.groupby('ID')['words'].apply(combinations,2)\
                     .apply(list).apply(pd.Series)\
                     .stack().apply(pd.Series)\
                     .set_axis(['WordsA','WordsB'],1,inplace=False)\
                     .reset_index(level=0)
Borscht answered 3/12, 2017 at 14:5 Comment(3)
Strictly though, it looks like the result should be split into two columns in the OP, which is another step? I'm curious whether all this ends up any more efficient than just building a new df? Won't apply run as a python loop here?Rattly
Ah, I was slow writing my first comment re: further step :)Rattly
Hmmm, it seems there is better return DataFrame from apply ;)Overthrust
O
7

You can use groupby with apply and return DataFrame, last add reset_index for remove second level and then for create column from index:

from itertools import combinations

f = lambda x : pd.DataFrame(list(combinations(x.values,2)), 
                            columns=['wordA','wordB'])
df = (df.groupby('ID')['words'].apply(f)
                               .reset_index(level=1, drop=True)
                               .reset_index())
print (df)
   ID  wordA  wordB
0   1  word1  word2
1   1  word1  word3
2   1  word2  word3
3   2  word4  word5
4   3  word6  word7
5   3  word6  word8
6   3  word6  word9
7   3  word7  word8
8   3  word7  word9
9   3  word8  word9
Overthrust answered 3/12, 2017 at 15:0 Comment(0)
M
2

You can define a custom function that is applied to each group. Both input and output are a dataframe:

def combine(group):
    return pd.DataFrame.from_records(itertools.combinations(group.word, 2))

df.groupby('ID').apply(combine)

Result:

          0      1
ID                
1  0  word1  word2
   1  word1  word3
   2  word2  word3
2  0  word4  word5
3  0  word6  word7
   1  word6  word8
   2  word6  word9
   3  word7  word8
   4  word7  word9
   5  word8  word9
Miniature answered 3/12, 2017 at 14:16 Comment(0)
L
0

The easiest way to do this is:

from itertools import combinations
import pandas as pd

df_new = pd.DataFrame(list(combinations(df.words, 2)), columns=['word1', 'word2'])
Lookin answered 18/9, 2019 at 21:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.