Query for best match to a string with SPARQL?
Asked Answered
C

1

6

I have a list with movie titles and want to look these up in DBpedia for meta information like "director". But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match.

How can I get the best match for a movie title from DBpedia using SPARQL?

Some problematic examples:

  • My List: "Die Hard: with a Vengeance" vs. DBpedia: "Die Hard with a Vengeance"
  • My List: "Hachi" vs. DBpedia: "Hachi: A Dog's Tale"

My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. E.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "die") && 
      contains(lcase(str(?title)),"hard")
   )
}
ORDER BY (?title)
LIMIT 1

This approach is very slow and also sometimes fails, e.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "hachi") 
   )
}
ORDER BY (?title)
LIMIT 10

where the correct result is on second place:

  resource                                          title                        director
  http://dbpedia.org/resource/Chachi_420            "Chachi 420"@en              http://dbpedia.org/resource/Kamal_Haasan
  http://dbpedia.org/resource/Hachi:_A_Dog's_Tale   "Hachi: A Dog's Tale"@en     http://dbpedia.org/resource/Lasse_Hallström    
  http://dbpedia.org/resource/Hachiko_Monogatari    "Hachikō Monogatari"@en      http://dbpedia.org/resource/Seijirō_Kōyama
  http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"@en   http://dbpedia.org/resource/Shajoon_Kariyal

Any ideas how to solve this problem? Or even better: How to query for best matches to a string with SPARQL in general?

Thanks!

Chadburn answered 30/7, 2016 at 7:8 Comment(7)
SPARQL endpoints are not text search engine, thus, there is only limited support for string matching in the SPARQL standards. Some triple stores do have some extended support, depending on the underlying implementation. E.g. some triple stores use Lucene for text search, while others like Virtuoso have some built-in functions.Lymn
The DBpedia endpoint uses Virtuoso, so you could have a look at docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext . E.g. bif:contains is much faster on indexed literals than regular REGEX. An example from the docs is ?s foaf:Name ?name . ?name bif:contains "'rich*'". which would match all subjects whose foaf:Name contain the word Rich. This would match Richard, Richie etc.Lymn
@AKSW Thanks for the hint with bif:contains. I will take a look at that.Chadburn
Have a look at #24557520. As mentioned, SPARQL isn't really for string processing, but though can do a lot, even if it won't be super performant. That link shows how you can compute some edit distances with Sparql.Asiaasian
@JoshuaTaylor Thanks for the link! I tried that approach and came up with a pretty good working solution (see my answer).Chadburn
@AKSW Can Lucene be added to Fuseki? I found this, however it seems this is related to Fuseki's jena API. Of course, it won't be much useful on the standalone server. But it's good for testing purposes.Culprit
@RFNO jena.apache.org/documentation/query/…Lymn
C
2

I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains:

   SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch

   WHERE {
      ?resource foaf:name ?title .
      ?resource rdf:type schema:Movie .
      ?resource dbo:director ?director .
      bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match ) 
   }

   ORDER BY DESC(?lenMatch) ASC(?lenTitle)

   LIMIT 5

It's not perfect, so I'm still open for suggestions.

Chadburn answered 31/7, 2016 at 14:18 Comment(1)
Can you explain what each part is doing? I want to be able to search for "Die_Hard" while ignoring the _ (underline) and making it case insensitive. I searched with your code and it gave me too many hits!Culprit

© 2022 - 2024 — McMap. All rights reserved.