How to find similar content using SPARQL
Asked Answered
M

2

8

I'm playing with the idea of using SPARQL to identify conceptual overlap between things.

Take movies for example (LinkedMDB data), if I have a movie, "The Matrix" and my goal is to list movies that are similar to that movie, I would probably start by doing the following:

  • The Matrix
    • get genre
    • get actors
    • get director
    • get location
    • etc

And then using the things I identified in the matrix, I would query for things with those properties (pseudo-query)

SELECT movie, genre, director, location, actors
WHERE {
  genre is action or sci-fi .

  director are the Wachowski brothers .

  location is set in a big city .

  OPTIONAL( actors were in the matrix . )
}

Is there something in SPARQL that allows me to check for overlap of properties between different nodes? Or must this be done manually like I've proposed?

Matchless answered 22/1, 2014 at 17:46 Comment(0)
W
13

Matching some specific properties

It sounds like you're asking for something along the lines of

select ?similarMovie ?genre ?director ?location ?actor where { 
  values ?movie { <http://.../TheMatrix> }
  ?genre   ^:hasGenre ?movie, ?similarMovie .
  ?director ^:hasDirectory ?movie, ?similarMovie .
  ?location ^:hasLocation ?movie, ?similarMovie .
  optional { ?actor ^:hasActor ?movie, ?similarMovie .
}

That uses the backwards path notation ^ and object lists to make it much shorter than:

select ?similarMovie ?genre ?director ?location ?actor where { 
  values ?movie { <http://.../TheMatrix> }
  ?movie        :hasGenre    ?genre .
  ?movie        :hasDirector ?director .
  ?movie        :hasLocation ?location .
  ?similarMovie :hasGenre    ?genre .
  ?similarMovie :hasDirector ?director .
  ?similarMovie :hasLocation ?location .
  optional { 
    ?movie        :hasActor ?actor .
    ?similarMovie :hasActor ?actor .
  }
}

For instance, using DBpedia, we can get other films that have the same distributor and cinematographer as The Matrix:

select ?similar ?cinematographer ?distributor where {
  values ?movie { dbpedia:The_Matrix }
  ?cinematographer ^dbpprop:cinematography ?movie, ?similar .
  ?distributor ^dbpprop:distributor ?movie, ?similar .
}
limit 10

SPARQL Results

The results are all within that same franchise; you get: The Matrix, The Matrix Reloaded, The Matrix Revolutions, The Matrix (franchise), and The Ultimate Matrix Collection.

Matching at least some number of properties

It's also possible to ask for things that have at least some number of properties in common. How many properties two things need to have in common before they should be considered similar is obviously subjective, will depend on the particular data, and will need some experimentation. For instance, we can ask for Films on DBpedia that have at least 35 properties in common with the Matrix with a query like this:

select ?similar where { 
  values ?movie { dbpedia:The_Matrix }
  ?similar ?p ?o ; a dbpedia-owl:Film .
  ?movie   ?p ?o .
}
group by ?similar ?movie
having count(?p) > 35

SPARQL results

This gives 13 movies (including the Matrix and the other movies in the franchise):

  • V for Vendetta (film)
  • The Matrix
  • The Postman (film)
  • Executive Decision
  • The Invasion (film)
  • Demolition Man (film)
  • The Matrix (franchise)
  • The Matrix Reloaded
  • Freejack
  • Exit Wounds
  • The Matrix Revolutions
  • Outbreak (film)
  • Speed Racer (film)

Using this kind of approach, you could even use the number of common properties as a measure of similarity. For instance:

select ?similar (count(?p) as ?similarity) where { 
  values ?movie { dbpedia:The_Matrix }
  ?similar ?p ?o ; a dbpedia-owl:Film .
  ?movie   ?p ?o .
}
group by ?similar ?movie
having count(?p) > 35
order by desc(?similarity)

SPARQL results

The Matrix             206
The Matrix Revolutions  63
The Matrix Reloaded     60
The Matrix (franchise)  55
Demolition Man (film)   41
Speed Racer (film)      40
V for Vendetta (film)   38
The Invasion (film)     38
The Postman (film)      36
Executive Decision      36
Freejack                36
Exit Wounds             36
Outbreak (film)         36
Winnifredwinning answered 22/1, 2014 at 17:58 Comment(7)
That's great. while i'm still learning, are there any other ways (using sparql) you could go about solving that same problem?Matchless
@Matchless The original question was a little bit vague, but I think that if you've got one movie already and you're looking for other movies that "agree" on some given set of properties, this is probably going to be the way to do it. If you wanted to be a bit more general, e.g., "find me other films that have at least 5 properties in common, but I don't care what the properties are", you could probably do that too.Winnifredwinning
@Matchless I updated my answer with an example of that approachWinnifredwinning
Thanks Joshua, you're the man. BTW, that second answer is so cool.Matchless
Sorry, one more question. On the latter examples, how would I select some additional property and show it in the result? lets say I wanted to list the Movie name, the property similarity count, and the date, would I be using ?similar as the subject?Matchless
@Matchless yes; you'd be looking for information about ?similar, so it'd be just like any other information for ?similar, just like we asked for ?similar a dbpedia-owl:Film, you could add in rdfs:label ?label (in which case you might also want to filter langMatches(lang(?label),"en") (or replace "en" with some other appropriate language).Winnifredwinning
These queries don't work anymore. I'm new to Sparql so I'm not sure what is wrong. The error is: Virtuoso 37000 Error SP030: SPARQL compiler, line 12: syntax error at '(' before '?p' SPARQL query: #output-format:text/html define sql:signal-void-variables 1 define input:default-graph-uri <dbpedia.org> PREFIX dbpedia: <dbpedia.org/resource> PREFIX dbpedia-owl: <ttps://www.w3.org/2002/07/owl#> select ?similar where { values ?movie { dbpedia:The_Matrix } ?similar ?p ?o ; a dbpedia-owl:Film . ?movie ?p ?o . } group by ?similar ?movie having count(?p) > 35Hibiscus
E
0

With the new prefixes in DBpedia, the answer of Joshua Taylor would be:

select ?similar (count(?p) as ?similarity) where { 
  values ?movie { dbr:The_Matrix }
  ?similar ?p ?o ; a dbo:Film .
  ?movie   ?p ?o .
}
group by ?similar ?movie
having (count(?p) > 35)
order by desc(?similarity)

SPARQL results

Eustacia answered 27/10, 2019 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.