Find people who bought the same games as someone else

g.addV('person').property(id, 'john') .addV('person').property(id, 'jim') .addV('person').property(id, 'pam') .addV('game').property(id, 'G1') .addV('game').property(id, 'G2') .addV('game').property(id, 'G3').iterate() g.V('john').as('p').V('G1').addE('bought').from('p').iterate() g.V('john').as('p').V('G2').addE('bought').from('p').iterate() g.V('john').as('p').V('G3').addE('bought').from('p').iterate() g.V('jim').as('p').V('G1').addE('bought').from('p').iterate() g.V('jim').as('p').V('G2').addE('bought').from('p').iterate() g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()

g.V('john').as('target') Target person we are interested in comparing against .out('bought').aggregate('target_games') // Games bought by target .in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates) .group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games .unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games

There are a few different ways this query could be written. Here is one way that uses a mid traversal V step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   V().
......4>   hasLabel('person').
......5>   where(neq('j')).
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]

However, the mid traversal V approach is not really needed as you can just look at the incoming vertices from the games that Jown owns

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   group().
......6>     by(id).
......7>     by(out('bought').where(within('owns')).dedup().fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]

Finally, here is a third way, where the dedup step is applied sooner. This is likely to be the most efficient of the three.

gremlin> g.V('john').as('j').
......1>   out().
......2>   aggregate('owns').
......3>   in('bought').
......4>   where(neq('j')).
......5>   dedup().
......6>   group().
......7>     by(id).
......8>     by(out('bought').where(within('owns')).fold())

==>[pam:[v[G1]],jim:[v[G1],v[G2]]]

UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group from a projection like this:

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   project('johnGames','personPurchasedJohnGames').
......5>     by(select('johnGames')).
......6>     by(select('personPurchasedJohnGames')).
......7>   group().
......8>     by(select('personPurchasedJohnGames')).
......9>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]

but actually you can further reduce this to

gremlin> g.V('john').as('j').
......1>   out().as('johnGames').
......2>   in('bought').
......3>   where(neq('j')).as('personPurchasedJohnGames').
......4>   group().
......5>     by(select('personPurchasedJohnGames')).
......6>     by(select('johnGames').fold())

==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]

So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j') in the other queries not really a big deal.

EDITED AGAIN to add ordering of results

g.V('john').as('j').
   out().as('johnGames').
   in('bought').
   where(neq('j')).as('personPurchasedJohnGames').
   group().
     by(select('personPurchasedJohnGames')).
     by(select('johnGames').fold()).
   unfold().
   order().
    by(select(values).count(local),desc)

{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}

Recommended topics

Hot tags