I have a large rdf file:
- size: 470MB
- number of lines: almost 6 million
- unique triple subjects: about 650,000
- triple amount: about 4,200,000
I loaded the rdf definition into the berkeley db backend of rdflib via:
graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")
It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?
What is really irritating is the time it takes for the following queries to complete:
SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
?s ?p ?o
}
(Result: 667,445)
took over 20 minutes and
SELECT (COUNT(?s) as ?c)
WHERE {
?s ?p ?o
}
(Result: 4,197,399)
took over 25 minutes.
I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.
So my questions are:
Why is rdflib so slow (especially for queries)?
Can I tune / optimize the database, like I can with indexes in a RDBMS?
Is another (free and "compact") triple store better suited for data of this size, performance-wise?
rdflib
on top of a relational database instead of a "proper" triple store? There are some open source, e.g. Apache Jena Fuseki, Virtuoso, etc. – Reconstruction