Find people who bought the same games as someone else
Asked Answered
P

1

7

I'm using Amazon Neptune to create and query a simple graph database. I'm currently running my code in an AWS Jupyter Notebook but will eventually move the code to Python (gremlin_python). As you can probably guess I'm pretty new to Gremlin and graph databases in general.

I have the following data

g.addV('person').property(id, 'john')
 .addV('person').property(id, 'jim')
 .addV('person').property(id, 'pam')
 .addV('game').property(id, 'G1')
 .addV('game').property(id, 'G2')
 .addV('game').property(id, 'G3').iterate() 

g.V('john').as('p').V('G1').addE('bought').from('p').iterate()
g.V('john').as('p').V('G2').addE('bought').from('p').iterate()
g.V('john').as('p').V('G3').addE('bought').from('p').iterate()

g.V('jim').as('p').V('G1').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G2').addE('bought').from('p').iterate()

g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()

3 persons and 3 games in the database. My goal is, given a person, tell me which persons have bought the same games as them and which games are those

After looking at sample code (mostly from https://tinkerpop.apache.org/docs/current/recipes/#recommendation) I have the following code that tries to find games bought by

g.V('john').as('target')   Target person we are interested in comparing against
.out('bought').aggregate('target_games') // Games bought by target
.in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates)
.group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games
.unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games

Which gives me the results:

  • {v[jim]: 2}
  • {v[pam]: 1}

Which tells me that jim has 2 of the same games as john while pam only has 1. But I want my query to return the actual games they have in common like so (still in order of most common games):

  • {v[jim]: ['G1', 'G2']}
  • {v[pam]: ['G1]}

Thanks for your help.

Prefatory answered 20/2, 2021 at 15:10 Comment(3)
I provided a few different ways to do this in the answer below. You might want to change the question title to something like "Find people that own the games I own". The current title will not help other people searching on this same kind of issue. Thanks for providing the sample graph. That really helps when answering questions.Chatwin
Everything has been answered below by @KelvinLawrence except how to order the final results by number of common games.Prefatory
I added a version of the query that sorts by number of games in commonChatwin
C
6

There are a few different ways this query could be written. Here is one way that uses a mid traversal V step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   V().
......4>   hasLabel('person').
......5>   where(neq('j')).
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]       

However, the mid traversal V approach is not really needed as you can just look at the incoming vertices from the games that Jown owns

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   group().
......6>     by(id).
......7>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]         
                   

Finally, here is a third way, where the dedup step is applied sooner. This is likely to be the most efficient of the three.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   dedup().
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]    

UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group from a projection like this:

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   project('johnGames','personPurchasedJohnGames').
......5>     by(select('johnGames')).
......6>     by(select('personPurchasedJohnGames')).
......7>   group().
......8>     by(select('personPurchasedJohnGames')).
......9>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]      

but actually you can further reduce this to

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   group().
......5>     by(select('personPurchasedJohnGames')).
......6>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]        

So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j') in the other queries not really a big deal.

EDITED AGAIN to add ordering of results

g.V('john').as('j').
   out().as('johnGames').
   in('bought').
   where(neq('j')).as('personPurchasedJohnGames').
   group().
     by(select('personPurchasedJohnGames')).
     by(select('johnGames').fold()).
   unfold().
   order().
    by(select(values).count(local),desc)

{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}
Chatwin answered 21/2, 2021 at 14:18 Comment(8)
Awesome! this is really helpfull. I have been trying this out myself, couldn't get it working mainly because i was approaching in different way :( . All these three solutions are selecting the games of people how have purchased John games and exclude John games from the list. is it possible to do this without traversing all the games of each individual? we are starting at John , going to his games(collect them), get all other people who purchased these games and stop there, because we have everything we need, and simply group the games and people?Ashley
let say g.V('john').as('j').out().as('johnGames').in().where(neq('j')).as('personPurchasedJohnGames').project('johnGames','personPurchasedJohnGames').by(select('johnGames')).by(select('personPurchasedJohnGames')) , but just not able to group by personAshley
I added an example of how to group the results of the project and also how to remove it.Chatwin
Great!! Last one is what i have ben trying to get to. i didn't use .fold() in .by(select('personPurchasedJohnGames')).by(select('johnGames').fold()) and have been getting just 1 game. still learning only a month into Graph database. Appreciate the answers with multiple examples. I would double upvote if i could :)Ashley
Glad it helped. Please accept the answer so others can see it is answered.Chatwin
Thank you so much for your help. I have tried a few of the queries and they have all worked. The only thing missing for me is how to order the results by number of common games.Prefatory
Just to clarify - you want to order the results basically by the count of common games so the ones with the most come first etc?Chatwin
I added a version of the query that sorts by number of games in commonChatwin

© 2022 - 2024 — McMap. All rights reserved.