SPARQL query including a subquery on Wikidata gives unexpected results
Asked Answered
T

2

6

I know the following SPARQL against Wikidata SPARQL Endpoint query is senseless. A similar query is automatically generated from within my application. Please disregard the conceptual soundness, and let's dig into this strange (for me at least) thing happening.

SELECT ?year1 ?year_labelTemp
    WHERE
      { 
        ?year1  <http://www.w3.org/2000/01/rdf-schema#label>  ?year_labelTemp .
        { SELECT distinct ?year1
          WHERE
            { ?film  <http://www.wikidata.org/prop/direct/P577>  ?date ;
                     <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q11424>
              BIND(year(?date) AS ?year1)
            }
        }   
      }
    limit 10

According to query evaluation in SPARQL, the subquery is evaluated first, and its results are then projected out to the containing query. Consequently, this subquery will be evaluated first.

SELECT distinct ?year1
      WHERE
        { ?film  <http://www.wikidata.org/prop/direct/P577>  ?date ;
                 <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q11424>
          BIND(year(?date) AS ?year1)
        }

The subquery gives exactly the results expected (130 different years). Then, the results of this subquery (?year1 variable) will be projected out and joined with the triple pattern in the outer select.

?year1  <http://www.w3.org/2000/01/rdf-schema#label>  ?year_labelTemp .

However, as the outer select shouldn't have any data (no labels for ?year1), the join will give no results.

Surprisingly (at least for me), executing the whole query ()stated first gives results, and the results are weird.

 wd:Q43576  Mië
 wd:Q221    Masèdonia
 wd:Q221    Республикэу Македоние
 wd:Q221    Republiek van Masedonië
 wd:Q212    Украина
 wd:Q212    Ukraina
 wd:Q212    Украинэ
 wd:Q212    Oekraïne
 wd:Q207    George W. Bush
 wd:Q207    George W. Bush

What am I missing?

Tillfourd answered 28/4, 2018 at 13:23 Comment(2)
That's what people call a bug in the Blazegraph backend.Sweettalk
The same problem happens with a local extraction deployed into a graphdb!Tillfourd
F
2

The problem is that sometimes BIND does not project variables correctly.

You can check this with the following query:

SELECT ?year1 ?year_labelTemp ?projected
    WHERE
      { 
        ?year1  rdfs:label  ?year_labelTemp .
        hint:Prior hint:runLast true .
        { SELECT DISTINCT ?year1
          WHERE
            { ?film  wdt:P577  ?date ;
                     wdt:P31 wd:Q11424
              BIND(year(?date) AS ?year1)
              hint:SubQuery hint:runOnce true 
            }
         } 
        BIND(bound(?year1) AS ?projected)
      }
    LIMIT 10

Try it!

Fortunately, the following trick helps:

SELECT ?year1 ?year_labelTemp
    WHERE
      { 
        ?year1  rdfs:label  ?year_labelTemp  .
        hint:Prior hint:runLast true .
        { SELECT DISTINCT ?year1
          WHERE
            { ?film  wdt:P577  ?date ;
                     wdt:P31 wd:Q11424
              BIND(year(?date) AS ?year1)
              FILTER (?year1 > 0)
            }
         } 
      }
    LIMIT 10

Try it!


The bug can be reproduced without nested subqueries and with hint:Query hint:optimizer "None", thus it should be not a query optimizer bug. But it's interesting that the bug disappears after replacing wd:Q11424 with wd:Q24862.

BLZG-963 seems to be the most related issue (as you can see, built-in functions are involved too).

Fawcette answered 28/4, 2018 at 23:58 Comment(6)
Thank you for the answer and the information. However, I I don't think it solves the problem. I think there are no results in your last query not because of a correct execution. To reproduce another check query, let's try this one. Selecting a single value from the subquery, which is 1, trying to project it out to the outer query, and then trying just to add a single triple pattern to get the type of the variable projected outTillfourd
SELECT ?year1 ?x ?y WHERE { { SELECT ?year1 WHERE { BIND (2 as ?year1) } } optional {?x ?y ?year1 .} } LIMIT 1Tillfourd
One funny truth: As I am using GraphDB as my backend, changing the order of the statments changes the results. If the subquery is stated in the outer query, and then the triple pattern, using your suggested trick (FILTER ?year1 > 0), it works just as expected, using the other possible order around takes us back to square 1.Tillfourd
I will do some more tests in the meantime.Tillfourd
Now I am not sure it's a BIND problem. The following query still fails SELECT ?year1 ?year_labelTemp WHERE { ?year1 rdfs:label ?year_labelTemp . hint:Prior hint:runLast true . { SELECT DISTINCT (year(?date) as ?year1) WHERE { ?film wdt:P577 ?date ; } } } LIMIT 10Tillfourd
@MedianHilal, semantically, this is more or less the same. It seems that the problem may appear when XPath functions results are projected, as in BLZG-963. The problem is rather complex, but nested subqueries are not necessary to reproduce.Fawcette
G
0

You wrote that the subquery gave the exact expected result, but I think you missed one value! There are films with empty unknown value as publication data, for example Q18844655 (at least when I'm writing this). It was this empty value that resulted in the seemingly random objects being found.

If you change your inner SELECT by adding for example FILTER(datatype(?date) = xsd:dateTime). you will only get actual dates and therefore only actual years, which means one value less than without the filter. Try it here!

(When this corrected inner SELECT is used the whole thing then timeouts. The labelling really doesn't like odd values like these, it seems.)

Glavin answered 5/7, 2019 at 15:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.