Graph Database or Relational Database Common Table Extensions: Comparing acyclic graph query performance

Asked 27/7, 2020 at 9:13 Answered 2/8, 2020 at 6:54

neo4j common-table-expression graph-databases janusgraph scylla

Are graph databases more performant than relational databases for highly connected acyclic graph data?

I need to significantly speed up my query results and hope that graph databases will be the answer. I had seen significant improvement in my relational database queries when I used Common Table Extensions bringing a recursive search of my sample data from 16 hours to 30 minutes. Still, 30 minutes is way too long for a web application and trying to work around that kind of response gets rather ridiculous pretty quickly relying on caching.

My Gremlin query looks something like:

g.withSack(100D).
V(with vertex id).
repeat(out('edge_label').
sack(div).by(constant(2D))).
emit().
group().by('node_property').by(sack().sum()).
unfold().
order().by(values,decr).
fold()

a Cypher equivalent (thank you cyberSam) something like:

MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

and my SQL roughly like:

WITH PctCTE(id, pt, tipe, ct)
AS
    (SELECT id, CONVERT(DECIMAL(28,25),100.0) AS pt, kynd, 1 
        FROM db.reckrd parent
        WHERE parent.id = @id
    UNION ALL
        SELECT child.id, CONVERT(DECIMAL(28,25),parent.pt/2.0), child.kynd, parent.ct+1
        FROM db.reckrd AS child
        INNER JOIN PctCTE AS parent
        ON (parent.tipe = 'M' AND
        (child .emm = parent.id))
        OR
        (NOT parent.tipe = 'M' AND
        (child .not_emm = parent.id))
    ),
    mergeCTE(dups, h, p)
    AS
        (SELECT ROW_NUMBER () OVER (PARTITION BY id ORDER BY ct) 'dups', id, SUM(pt) OVER (PARTITION BY id)
        FROM PctCTE
        )

which should return a result set with 500,000+ edges in my test instance.

If I filtered to reduce the size of the output, it would still have to be after traversing all of those edges first for me to get to the interesting stuff I want to analyse.

I can foresee some queries on real data getting closer to having to traverse 3,000,000+ edges ...

If graph databases aren't the answer, is a CTE as good as it gets?

Kentiga answered 27/7, 2020 at 9:13 Comment(2)

Since you tagged neo4j: can you also express your query in Cypher (or at least explain what it is trying to do?). – Dagan 27/7, 2020 at 16:43

Fair point : ) The traversal begins with a selected node and follows along each relationship that connects it to another "child" node. The starting node is given a value of 100. For each layer further away a node gets its assigned value reduced by half so that a "child" node is given a value of 50, a "grandchild" 25. After all relationship leafs are folllowed to the end, nodes that were visited more than once have their values summed together. Finally the list is ordered by these assigned values in descending order. I haven't figured how to get Gremlin to include more data from visited nodes. – Kentiga 27/7, 2020 at 17:31

I tried JanusGraph-0.5.2 with BerkeleyDB Java Edition. My sample data set has 580832 vertices, 2325896 edges loaded from a roughly 1 gb graphML file. The network average degree is 4, diameter 30, average path length 1124, modularity 0.7, average clustering coefficient 0.013 and eigenvector centrality (100 iterations) of 4.5.

No doubt I am doing my query rather amatuerishly, but after waiting 10 hours only to receive a Java stack out of memory error, it is clear that my CTE performance is at least 20 times faster!!!

My conf/janusgraph-berkeleyje.properties file included the following settings:

gremlin.graph = org.janusgraph.core.JanusGraphFactory
storage.backent = berkeleyje
storage.directory = ../db/berkeley
cache.db-cache = true
cache.db-cache-size = 0.5
cache.db-cache-time = 0
cache.tx-cache-size = 20000
cache.db-cache-clean-wait = 0
storage.transaction = false
storage.berkeleyje.cache-percentage = 65

At this stage in my investigation, it would appear that CTE's are at least an order of magnitude more performant on heavily recursive queries than graph databases. I would love to be wrong...

Kentiga answered 2/8, 2020 at 6:54 Comment(0)

[UPDATED]

When using neo4j, here is a roughly equivalent Cypher query, which uses a variable-length relationship pattern:

MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

A variable-length relationship has exponential complexity. If the average degree-ness is D and the maximum depth is N, then you should expect a complexity of about O(D^N). With your use case, that is on the order of about 4^30 operations.

However, since in your use case a node's contribution to its total weight decreases exponentially by its depth in a given path, you could get a close approximation to the actual result by simply ignoring nodes that are beyond a threshold depth.

For example, a node at a depth of 8 would only add 0.0078 to its total weight. And the complexity at that depth would only be 4^8 (or 65K), which should be reasonably fast. The Cypher query for a max depth of 8 would be only slightly different:

MATCH p=(f:Foo)-[:edge_label*..8]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

Dagan answered 27/7, 2020 at 19:50 Comment(5)

Thank you for taking an interest in my situation! And thank you for the Cypher equivalency. -- The problem with my data is that the interesting things are discovered in the deep reaches of that exponential complexity. I am looking at effects where a "parent" 15 or 25 "generations" back can make up as much as 25% of a "descendent" when those small effects are summed together. It is the accumulated composition of these distant ingredients that is of interest. – Kentiga 27/7, 2020 at 20:24

The Cypher equivalency looks logically correct as does the variable-length relationship pattern designation. – Kentiga 27/7, 2020 at 20:32

I've updated my answer to correct the complexity formula, which unfortunately shows that your use case is even harder than I originally stated. You could try using a max depth of 15, but that would still mean over a billion operations. – Dagan 27/7, 2020 at 21:36

Appreciated, both your diligence and the operational implications. There are various potential datasets of interest to me and my users. The 3,000,000+ group have a shallower depth, but a 1,500,000 group of great interest has the greatest depth of all with much of interest in the 30+ depth range where the "weight" measure will be loosing accuracy! But aren't we living in the age of BIG data? – Kentiga 27/7, 2020 at 21:48

Any chance we can get the data or a data generator? Objectivity/DB has been used for exabyte scale object and graph databases where queries commonly go 30, 40, or 50 degrees in near-real-time. I'd be interested in seeing the actual problem and data. – Nappy 17/11, 2020 at 16:24

Recommended topics

Hot tags