Why is rdflib so slow?
Asked Answered
B

1

8

I have a large rdf file:

  • size: 470MB
  • number of lines: almost 6 million
  • unique triple subjects: about 650,000
  • triple amount: about 4,200,000

I loaded the rdf definition into the berkeley db backend of rdflib via:

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")

It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?

What is really irritating is the time it takes for the following queries to complete:

SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 667,445)

took over 20 minutes and

SELECT (COUNT(?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 4,197,399)

took over 25 minutes.

I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.

So my questions are:

Why is rdflib so slow (especially for queries)?

Can I tune / optimize the database, like I can with indexes in a RDBMS?

Is another (free and "compact") triple store better suited for data of this size, performance-wise?

Bawl answered 12/6, 2019 at 15:6 Comment(8)
the question would be, why using rdflib on top of a relational database instead of a "proper" triple store? There are some open source, e.g. Apache Jena Fuseki, Virtuoso, etc.Reconstruction
regarding your question, I doubt any index is used when the query takes 20min to complete. But that's something the devs can answer betterReconstruction
I looked into the implementation, and I think your query is horrible for it. I mean, it's not a store which does SPARQL to SQL rewriting but implements an iterator model + some indices in the DB. So it has to get all triples and then do the count in-memory. But sure, it still looks a bit slow.Reconstruction
Here is some related issue: github.com/RDFLib/rdflib/issues/787Reconstruction
Thank you for your answers. My resulting question is: Why use rdflib at all with berkeley db, if a main use case of rdflib is storing and querying triples, and rdflib with berkeley db is obviously not suited for it?Bawl
In the early days of RDF there were no native RDF stores yet. The first RDF stores were built on top of existing storage engines, such as SQL databases and BDB. The rdflib implementation goes back to these early days. This is now an obsolete approach, as native stores offer much better performance and full SPARQL compliance. (Virtuoso is an interesting outlier here; AIUI its RDF store today is still a highly tuned relational engine, and actually has great performance.)Broadminded
So, it makes sense to always use an external RDF store, right?Bawl
And would you still recommend rdflib to fill and query an external store (when using python)? Or are there better alternatives (with or without python)?Bawl
L
1

I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.

see: https://pypi.org/project/oxrdflib/

Logarithmic answered 6/2, 2022 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.