Gremlin query for groupcount of last value
Asked Answered
D

2

1

I have a titan graph with two sets of vertices, one for UserID and another for Products. The edges between the UserID and Product vertices contain the "Dates" when products were purchased. I am looking for a gremlin query that can give me a groupcount of userids by the last product purchased

For example if had data like below

UserID,Date,Product
A,2016-01-01,'Razor'
A,2016-01-02,'Toothpaste'
B,2016-02-01,'Toothpaste'
B,2016-02-02,'Razor'
C,2016-01-04,'Toothpaste'

I am looking for an output like below

Product, Count
'Toothpaste',2
'Razor',1

Would appreciate any help.

Dogberry answered 3/2, 2017 at 4:0 Comment(0)
C
2

The following query works in OLTP and OLAP and doesn't touch more vertices than necessary:

g.V().hasLabel("User").
  local(outE("purchased").order().by("date", decr).limit(1)).inV().
  groupCount().by("name")

Furthermore this query can be perfectly optimized by Titan, when you create a vertex centric index on date.

Crusade answered 3/2, 2017 at 22:59 Comment(1)
Thanks! This works perfectly. I just needed to change the property names to be consistent with Jason's code. Also thanks for the suggestion on the vertex centric index - will try it on the larger graph I have.Dogberry
S
3

This solution works for your example data:

g.V().hasLabel('Product').as('p').inE('Purchase').order().by('Date', decr).outV().dedup().select('p').groupCount().by('Name')

This is the algorithm:

  • Start from the products
  • Traverse to the purchase edges
  • Order the edges by date descending
  • Traverse to the users
  • Dedup the users; only the most recent edge per user will remain because of sorting
  • Jump back to the products
  • Group count by product name

Here's a Gremlin Console dump showing it in action:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> a = graph.addVertex(T.label, 'User', 'UserID', 'A')
==>v[0]
gremlin> b = graph.addVertex(T.label, 'User', 'UserID', 'B')
==>v[2]
gremlin> c = graph.addVertex(T.label, 'User', 'UserID', 'C')
==>v[4]
gremlin> r = graph.addVertex(T.label, 'Product', 'Name', 'Razor')
==>v[6]
gremlin> t = graph.addVertex(T.label, 'Product', 'Name', 'Toothpaste')
==>v[8]
gremlin> a.addEdge('Purchase', r, 'Date', new Date(2016, 0, 1))
==>e[10][0-Purchase->6]
gremlin> a.addEdge('Purchase', t, 'Date', new Date(2016, 0, 2))
==>e[11][0-Purchase->8]
gremlin> b.addEdge('Purchase', t, 'Date', new Date(2016, 1, 1))
==>e[12][2-Purchase->8]
gremlin> b.addEdge('Purchase', r, 'Date', new Date(2016, 1, 2))
==>e[13][2-Purchase->6]
gremlin> c.addEdge('Purchase', t, 'Date', new Date(2016, 0, 4))
==>e[14][4-Purchase->8]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:5 edges:5], standard]
gremlin> g.V().hasLabel('Product').as('p').inE('Purchase').order().by('Date', decr).outV().dedup().select('p').groupCount().by('Name')
==>[Toothpaste:2,Razor:1]
Sterling answered 3/2, 2017 at 6:26 Comment(1)
Thanks a lot. This works fine for me in OLTP mode like you have done. But I get an error Global traversals on GraphComputer may not contain mid-traversal barriers: OrderGlobalStep([decr(Date)]) in OLAP mode i.e. g = graph.traversal(computer()). I need to run something similar on a very large graph - it will be great if you have any pointers. I am running titan-1.0.0-hadoop1 versionDogberry
C
2

The following query works in OLTP and OLAP and doesn't touch more vertices than necessary:

g.V().hasLabel("User").
  local(outE("purchased").order().by("date", decr).limit(1)).inV().
  groupCount().by("name")

Furthermore this query can be perfectly optimized by Titan, when you create a vertex centric index on date.

Crusade answered 3/2, 2017 at 22:59 Comment(1)
Thanks! This works perfectly. I just needed to change the property names to be consistent with Jason's code. Also thanks for the suggestion on the vertex centric index - will try it on the larger graph I have.Dogberry

© 2022 - 2024 — McMap. All rights reserved.