"Supernodes" in Titan

I'm developing an application that could work well with a graph database (Titan), except it's having problems with vertices with many edges, i.e. supernodes.

The supernodes link above points to a blog post from the authors of Titan, explaining a way to resolve the problem. The solution seems to be reducing the number of vertices by filtering on edges.

Unfortunately I want to groupCount attributes of edges or vertices. For example I have 1 million users and each user belongs to a country. How can I do a fast groupCount to work out the number of users in each country?

What I've tried so far can be shown in this elaborate groovy script:

g = TitanFactory.open('titan.properties')  // Cassandra
r = new Random(100)
people = 1e6

def newKey(g, name, type) {
    return g
        .makeType()
        .name(name)
        .simple()
        .functional()
        .indexed()
        .dataType(type)
        .makePropertyKey()
}

def newLabel(g, name, key) {
    return g
        .makeType()
        .name(name)
        .primaryKey(key)
        .makeEdgeLabel()
}

country = newKey(g, 'country', String.class)
newLabel(g, 'lives', country)

g.stopTransaction(SUCCESS)

root = g.addVertex()
countries = ['AU', 'US', 'CN', 'NZ', 'UK', 'PL', 'RU', 'NL', 'FR', 'SP', 'IT']

(1..people).each {
    country = countries[(r.nextFloat() * countries.size()).toInteger()]
    g.startTransaction()
    person = g.addVertex([name: 'John the #' + it])
    g.addEdge(g.getVertex(root.id), person, 'lives', [country: country])
    g.stopTransaction(SUCCESS)
}

t0 = new Date().time

m = [:]    
root = g.getVertex(root.id)
root.outE('lives').country.groupCount(m).iterate()

t1 = new Date().time

println "groupCount seconds: " + ((t1 - t0) / 1000)

Basically one root node (for the sake of Titan not having an "all" nodes lookup), linked to many person via edges that have the country property. When I run the groupCount() on 1 million vertices, it takes over a minute.

I realise Titan is probably iterating over each edge and collecting counts, but is there a way to make this run faster in Titan, or any other graph database? Can the index itself be counted so it doesn't have to traverse? Are my indexes correct?

Recommended topics

Hot tags