Why is rdflib so slow? - McMap

About

Why is rdflib so slow?

Asked 12/6, 2019 at 15:6 Answered 6/2, 2022 at 14:53

sparql rdf rdflib

B

1

8

I have a large rdf file:

size: 470MB
number of lines: almost 6 million
unique triple subjects: about 650,000
triple amount: about 4,200,000

I loaded the rdf definition into the berkeley db backend of rdflib via:

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")

It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?

What is really irritating is the time it takes for the following queries to complete:

SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 667,445)

took over 20 minutes and

SELECT (COUNT(?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 4,197,399)

took over 25 minutes.

I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.

So my questions are:

Why is rdflib so slow (especially for queries)?

Can I tune / optimize the database, like I can with indexes in a RDBMS?

Is another (free and "compact") triple store better suited for data of this size, performance-wise?

Bawl answered 12/6, 2019 at 15:6 Comment(8)

the question would be, why using rdflib on top of a relational database instead of a "proper" triple store? There are some open source, e.g. Apache Jena Fuseki, Virtuoso, etc. – Reconstruction 12/6, 2019 at 16:36

regarding your question, I doubt any index is used when the query takes 20min to complete. But that's something the devs can answer better – Reconstruction 12/6, 2019 at 16:38

I looked into the implementation, and I think your query is horrible for it. I mean, it's not a store which does SPARQL to SQL rewriting but implements an iterator model + some indices in the DB. So it has to get all triples and then do the count in-memory. But sure, it still looks a bit slow. – Reconstruction 12/6, 2019 at 16:55

Here is some related issue: github.com/RDFLib/rdflib/issues/787 – Reconstruction 12/6, 2019 at 16:55

Thank you for your answers. My resulting question is: Why use rdflib at all with berkeley db, if a main use case of rdflib is storing and querying triples, and rdflib with berkeley db is obviously not suited for it? – Bawl 13/6, 2019 at 10:24

In the early days of RDF there were no native RDF stores yet. The first RDF stores were built on top of existing storage engines, such as SQL databases and BDB. The rdflib implementation goes back to these early days. This is now an obsolete approach, as native stores offer much better performance and full SPARQL compliance. (Virtuoso is an interesting outlier here; AIUI its RDF store today is still a highly tuned relational engine, and actually has great performance.) – Broadminded 13/6, 2019 at 12:37

So, it makes sense to always use an external RDF store, right? – Bawl 14/6, 2019 at 8:18

And would you still recommend rdflib to fill and query an external store (when using python)? Or are there better alternatives (with or without python)? – Bawl 14/6, 2019 at 8:20

L

1

I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.

see: https://pypi.org/project/oxrdflib/

Logarithmic answered 6/2, 2022 at 14:53 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.