how to remove duplicates in sparql query

Asked 1/4, 2016 at 4:55 Answered 1/4, 2016 at 11:28

I wrote this query and return list of couples and particular condition. ( in http://live.dbpedia.org/sparql)

SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
    select DISTINCT ?actor ?person2 (count (?film) as ?cnt) 
    where { 
        ?film    dbo:starring ?actor .
        ?actor dbo:spouse ?person2. 
        ?film    dbo:starring ?person2.
    }
    order by ?actor
}
FILTER (?cnt >9)
}

Problem is that some rows is duplicate. example:

http://dbpedia.org/resource/George_Burns http://dbpedia.org/resource/Gracie_Allen 12

http://dbpedia.org/resource/Gracie_Allen http://dbpedia.org/resource/George_Burns 12

how to remove these duplications? I added gender to ?actor but it damage current result.

Fullmouthed answered 1/4, 2016 at 4:55 Comment(0)

Natan Cox's answer shows the typical way to exclude these kind of pseudo-duplicates. The results aren't actually duplicates, because in one, e.g., George Burns is the ?actor, and in the other he is the ?person2. In many cases, you can add a filter to require that the two things are ordered, and that will remove the duplicate cases. E.g., when you have data like:

:a :likes :b .
:a :likes :c .

and you search for

select ?x ?y where { 
  :a :likes ?x, ?y .
}

you can add filter(?x < ?y) to enforce an ordering between the between ?x and ?y which will remove these pseudo-duplicates. However, in this case, it's a bit trickier, since ?actor and ?person2 aren't found using the same critera. If DBpedia contains

:PersonB dbo:spouse :PersonA

but not

:PersonA dbo:spouse :PersonB

then the simple filter won't work, because you'll never find the triple where the subject PersonA is less than the object PersonB. So in this case, you also need to modify your query a bit to make the criteria symmetric:

select distinct ?actor ?spouse (count(?film) as ?count) {
  ?film dbo:starring ?actor, ?spouse .
  ?actor dbo:spouse|^dbo:spouse ?spouse .
  filter(?actor < ?spouse)
}
group by ?actor ?spouse
having (count(?film) > 9)
order by ?actor

(This query also shows that you don't need a subquery here, you can use having to "filter" on aggregate values.) But the important part is using the property path dbo:spouse|^dbo:spouse to find a value for ?spouse such that either ?actor dbo:spouse ?spouse or ?spouse dbo:spouse ?actor. This makes the relationship symmetric, so that you're guaranteed to get all the pairs, even if the relationship is only declared in one direction.

Marquez answered 1/4, 2016 at 11:28 Comment(1)

Nice. You can even generalize the solution to any inverse property relationship. – Swellfish 5/4, 2016 at 18:49

It is not actual duplicates of course since you can look at it from both ways. The way to fix it if you want to is to add a filter. It is a bit of a dirty hack but it only takes one of the 2 rows that are the "same".

SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
    select DISTINCT ?actor ?person2 (count (?film) as ?cnt) 
    where { 
        ?film    dbo:starring ?actor .
        ?actor dbo:spouse ?person2. 
        ?film    dbo:starring ?person2.
FILTER (?actor < ?person2)


    }
    order by ?actor
}
FILTER (?cnt >9)
}

Swellfish answered 1/4, 2016 at 10:36 Comment(3)

I thought of this too, but this isn't quite enough, since it might be that that the triples aren't all present in both directions. This might miss solutions. – Marquez 1/4, 2016 at 11:19

Ok. True. This only works if you assume complete data or inferencing. – Swellfish 1/4, 2016 at 11:26

I've added an answer that shows how you can do this without inferencing (and also in a way that doesn't require a subquery). – Marquez 1/4, 2016 at 11:29

Recommended topics

Hot tags