How to select random DBPedia nodes from SPARQL?
Asked Answered
A

7

7

How can I select random sample from DBpedia using the sparql endpoint?

This query

SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

(found here) seems to work ok on most SPARQL endpoints, but on http://dbpedia.org/sparql it gets cached (so it returns always the same 10 nodes).

If i try from JENA, I get the following exception:

Unresolved prefixed name: bif:rnd

And I can't find the what the 'bif' namespace is.

Any idea on how to solve this?

Mulone

Agglutination answered 15/4, 2011 at 13:15 Comment(0)
B
7

bif:rnd is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

Where some_random_number is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.

Try to avoid completely open patterns like ?s ?p ?o and your query will be much more efficient.

Bedpost answered 15/4, 2011 at 14:30 Comment(2)
I guess the problem with your solution is that the selection order doesn't change, so it's not really random. Maybe I could improve its "randomness" by putting together samples from different queries, something like (query with offset x1)UNION(query with offset x2)UNION(query with offset xn)UNION.Agglutination
My selection order is not random but an OFFSET over a random number will give you a random sample. It's the OFFSET what needs to be generated randomly.Bedpost
I
9

In SPARQL 1.1 you can do:

SELECT ?s
WHERE {
  ?s ?p ?o
}
ORDER BY RAND()
LIMIT 10

I don't know offhand how many store will optimise, or even implement this yet though.

[see comment below, this doesn't quite work]

An alternative is:

SELECT (SAMPLE(?s) AS ?ss)
WHERE { ?s ?p ?o }
GROUP BY ?s

But I'd think that's even less likely to be optimised.

Immitigable answered 15/4, 2011 at 19:37 Comment(3)
Actually the 2nd one won't work, SAMPLE() returns an arbitrary value. but not a random one.Immitigable
I would only add 'DISTINCT': SELECT DISTINCT ?sEthylethylate
I believe RAND() is only called once, so it will be the same value for all the results. If so, try using "(SHA512(CONCAT(str(?s),str(RAND()))) as ?random) " and "ORDER BY ?random"Tigon
B
7

bif:rnd is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

Where some_random_number is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.

Try to avoid completely open patterns like ?s ?p ?o and your query will be much more efficient.

Bedpost answered 15/4, 2011 at 14:30 Comment(2)
I guess the problem with your solution is that the selection order doesn't change, so it's not really random. Maybe I could improve its "randomness" by putting together samples from different queries, something like (query with offset x1)UNION(query with offset x2)UNION(query with offset xn)UNION.Agglutination
My selection order is not random but an OFFSET over a random number will give you a random sample. It's the OFFSET what needs to be generated randomly.Bedpost
D
1

bif:rnd is a Virtuoso specific extension and will thus only work again Virtuoso SPARQL endpoints.

bif is the prefix for Virtuoso Built In Functions which enable any Virtuoso function to be called in SPARQL, with rnd being a Virtuoso function for returning random numbers.

Denys answered 18/4, 2011 at 0:0 Comment(0)
T
1

I encountered the same problem and none of the solutions here addressed my issue. Here is my solution; it was non-trivial and quite a hack. This works for DBPedia as of now, and may work for other SPARQL endpoints, but it is not guaranteed to work for future releases.

DBPedia uses Virtuoso, which supports an undocumented argument to the RAND function; the argument effectively specifies the range to use for the PRNG. The game is to trick Virtuoso into believing that the input argument cannot be statically-evaluated before each result row is computed, forcing the program to evaluate RAND() for every binding:

select * {
    ?s dbo:isPartOf ?o .  # Whatever your pattern is
    bind(rand(1 + strlen(str(?s))*0) as ?rid)
} order by ?rid

The magic happens in rand(1 + strlen(str(?s))*0) which generates the equivalent of rand(); but forces it to run on every match by exploiting the fact that the program cannot predict the value of an expression that involves some variable (in this case, we just compute the length of the IRI as a string). The actual expression is not important, since we multiply it by 0 to ignore it completely, then add 1 to make rand execute normally.

This only works because the developers did not go this far in their static-code-evaluation of expressions. They could have easily written a branch for "multiply by zero", but alas they did not :)

Thumb answered 2/2, 2016 at 2:11 Comment(0)
G
1

None of the above methods works with Jena/Fuseki, so I've done it in another way:

SELECT DISTINCT ?s ?p ?o
{
  ?s ?p ?o.
  BIND ( MD5 ( ?s ) AS ?rnd)
}
ORDER BY ?rnd ?p
LIMIT 100

Obviously this doesn't select random triples, but the set of the first k MD5-ordered subjects should have relevant features of a statistically significant sample (i.e. the sample is representative of the entire population, there is no particular selection bias).

Guthrun answered 16/11, 2016 at 21:52 Comment(2)
I have used this method several times myself, to finally find out there is unfortunately a selection bias. Comparing the distribution of a large sample with its original dataset, the selection was skewed towards determined characteristics. I suggest to use RAND() or bif:rnd when available.Garay
thanks @mommi84, would be interesting to know which bias there is and if there is some other function to combine with MD5 hashing to make the returned results more unbiased.Guthrun
F
1

After much experimentation I have ended up with the following solution, a combination of using a hash to avoid RAND() being statically-evaluated and RAND() to avoid the selection biases caused by only using a hash.

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?s))) AS ?random) .
} ORDER BY ?random
LIMIT 1

Here used to select a random valley glacier from Wikidata:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?item))) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 1

Try it (the service caches responses, you can bypass this by just making a new comment before running the query)

Forepleasure answered 15/9, 2020 at 20:2 Comment(0)
K
0
SELECT ?s WHERE { 
    ?s ?p ?o . 
    bind(<SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o) as ?rid)
}
ORDER BY ?rid
LIMIT 10

How about this one?

<SHORT_OR_LONG::bif:rnd> may be better than <bif:rnd>. (http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideRandomSampleAllTriples)

You simply bind random id (?rid) to each row of binding (?s ?p ?o) then order results by random id.

Kickback answered 23/9, 2016 at 12:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.