Searching for items in a many-to-many relationship
Asked Answered
S

3

7

I'm currently writing an application that allows one to store images, and then tag these images. I'm using Python and the Peewee ORM (http://charlesleifer.com/docs/peewee/), which is very similar to Django's ORM.

My data model looks like this (simplified):

class Image(BaseModel):
    key = CharField()

class Tag(BaseModel):
    tag = CharField()

class TagRelationship(BaseModel):
    relImage = ForeignKeyField(Image)
    relTag   = ForeignKeyField(Tag)

Now, I understand conceptually how to query for all Images that have a given set of tags:

SELECT Image.key
  FROM Image
INNER JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
INNER JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 WHERE Tag.tag
       IN ( 'A' , 'B' )     -- list of multiple tags
GROUP BY Image.key
HAVING COUNT(*) = 2         -- where 2 == the number of tags specified, above

However, I also want to be able to do more complex searches. Specifically, I'd like to be able to specify a list of "all tags" - i.e. an image must have all of the specified tags to be returned, along with a list of "any" and a list of "none".

EDIT: I'd like to clarify this a bit. Specifically, the above query is an "all tags"-style query. It returns Images that have all the given tags. I want to be able to specify something like: "Give me all images that have the tags (green, mountain), any one of the tags (background, landscape) but not the tags (digital, drawing)".

Now, ideally, I'd like this to be one SQL query, because pagination then becomes very easy with LIMIT and OFFSET. I've actually got an implementation working whereby I just load everything into Python sets and then use the various intersection operators. What I'm wondering is if there's a method of doing this all at once?

Also, for those interested, I've emailed the author of Peewee about how to represent the above query using Peewee, and he responded with the following solution:

Image.select(['key']).group_by('key').join(TagRelationship).join(Tag).where(tag__in=['tag1', 'tag2']).having('count(*) = 2')

Or, alternatively, a shorter version:

Image.filter(tagrelationship_set__relTag__tag__in=['tag1', 'tag2']).group_by(Image).having('count(*) = 2')

Thanks in advance for your time.

Sparklesparkler answered 16/1, 2012 at 6:41 Comment(8)
If I understand correctly, you have the answer you need in peewee code, but you want to know how to do the same thing in straight sql?Doggett
can you please explain the part "However, I also want to be able to do more complex searches. Specifically, I'd like to be able to specify a list of "all tags" - i.e. an image must have all of the specified tags to be returned, along with a list of "any" and a list of "none"."Guyenne
@Guyenne Specifically, the above query is an "all tags"-style query. It returns Images that have all the given tags. I want to be able to specify something like: "Give me all images that have the tags (green, mountain), any one of the tags (background, landscape) but not the tags (digital, drawing)". Please let me know if that's not clear.Sparklesparkler
Oh, that clarifies a lot. I thought the question above was a little ambiguous. That is possible though. Would it be okay if it had green, mountain and landscape?Roeder
@Roeder - Yeah, that'd be alright. Any of "green, mountain, landscape" or "green, mountain, background", with any other tags that aren't "digital" or "drawing" would work.Sparklesparkler
Okay so you want ((A and B) or (C and D)) but not (E and F)?Roeder
@Roeder - Using Boolean logic: (a AND b) AND (c OR d) AND (NOT (e OR f))Sparklesparkler
@Roeder can you check my updated answer?Guyenne
T
5
SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING SUM(Tag.tag IN (mandatory tags )) = N  /*the number of mandatory tags*/
   AND SUM(Tag.tag IN (optional tags  )) > 0
   AND SUM(Tag.tag IN (prohibited tags)) = 0

UPDATE

A more universally accepted version of the above query (converts the boolean results of the IN predicates into integers using CASE expressions):

SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING SUM(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 ELSE 0 END) = N  /*the number of mandatory tags*/
   AND SUM(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 ELSE 0 END) > 0
   AND SUM(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 ELSE 0 END) = 0

or with COUNTs instead of SUMs:

SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING COUNT(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 END) = N  /*the number of mandatory tags*/
   AND COUNT(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 END) > 0
   AND COUNT(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 END) = 0
Tradein answered 16/1, 2012 at 7:58 Comment(5)
Will the having statement work correctly? Since you do a GROUP BY, you will only have 1 value for tag. Since it's inside SUM, will it iterate over all tags and preform the IN check? This is much more elegant than mine if it works.Roeder
I thought after grouping by any attribute, you were unable to preform analysis on individual rows within groups such as Tag.tag IN (mandatory tags )) = N. I might have to set up some tables and test out this query for myself. I know aggregate operations such as SUM will work, but I don't know if Tag.tag IN (mandatory tags )) = N will work. Not because I'm doubting you, but because I have not seen or done this. It's new to me.Roeder
@JustinDanielson: Yes, you can have expressions referencing non-GROUP BY columns inside aggregate functions, if that's what you are asking about. The results of the expressions get aggregated accordingly. You know that you can SUM(Value), but in the same way you can SUM(Value * Qty) or SUM(Value > 10). In MySQL flavour of SQL, which what my query is, IN can be viewed as just another operator, like * or +, only it returns a boolean (which is implicitly converted to an integer), like < or = does too.Tradein
Thank you for explaining that with details. Glad I learned that. Will save me a lot of time in the event I have to write another query like this one. Check out mine, I did it the hard way. :[Roeder
Hi there - thanks for this! It works really well. Just one small change - the ">" in the second condition should be ">=", to cover the case that all the tags match. Other than that, thanks again. I've marked this as an answer :)Sparklesparkler
R
2

The top half gets the words that match the mandatory tags. The bottom half does the tags where at least 1 must be present. The bottom query doesn't have a GROUP BY because I want to know if an image appears twice. If it does, it has both background and landscape. The ORDER BY count(*) will make pictures with BOTH background and landscape tags to appear at the top. So green, mountain, background landscape will be the most relevant. Then green, mountain, background OR landscape pictures.

SELECT Image.key, count(*) AS 'relevance' 
FROM
     (SELECT Image.key
      FROM
        --good image candidates
        (SELECT Image.key
         FROM Image
         WHERE Image.key NOT IN 
            --Bad Images
            (SELECT DISTINCT(Image.key)   --Will reduce size of set, remove duplicates
             FROM Image
             INNER JOIN TagRelationship
                ON Image.ID = TagRelationship.ImageID
             INNER JOIN Tag
                ON TagRelationship.TagID = Tag.ID
              WHERE Tag.tag
                   IN ('digital', 'drawing' )))
    INNER JOIN TagRelationship
        ON Image.ID = TagRelationship.ImageID
    INNER JOIN Tag
        ON TagRelationship.TagID = Tag.ID
    WHERE Tag.tag
           IN ('green', 'mountain')
    GROUP BY Image.key
    HAVING COUNT(*) = count('green', 'mountain')
    --we need green AND mountain

    UNION ALL

    --Get all images with one of the following 2 tags
    SELECT * 
    FROM
        (SELECT Image.key
         FROM Image
         INNER JOIN TagRelationship
             ON Image.ID = TagRelationship.ImageID
         INNER JOIN Tag
             ON TagRelationship.TagID = Tag.ID
          WHERE Tag.tag
             IN ( 'background' , 'landscape' ))
)
GROUP BY Image.key
ORDER BY relevance DESC
Roeder answered 16/1, 2012 at 7:8 Comment(0)
G
0

Following query must return all images which are tagged with both ('A' and 'B') and ('C' OR 'D') but not 'E' and 'F'

SELECT Image.key
FROM Image
INNER JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
INNER JOIN Tag tag1
    ON TagRelationship.TagID = tag1.ID
INNER JOIN Tag tag2
    ON TagRelationship.TagID = tag2.ID
WHERE tag1.tag
    IN ( 'A' , 'B' )
AND tag2.tag NOT IN ('E', 'F')

GROUP BY Image.key
HAVING COUNT(*) = 2 

UNION

SELECT Image.key
FROM Image
INNER JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
INNER JOIN Tag tag1
    ON TagRelationship.TagID = tag1.ID
INNER JOIN Tag tag2
    ON TagRelationship.TagID = tag2.ID
WHERE tag1.tag
   IN ( 'C' , 'D' )
AND tag2.tag NOT IN ('E', 'F')
Guyenne answered 16/1, 2012 at 7:5 Comment(4)
If Tag.tag is in ('A', 'B'), it can never be found in ('E', 'F'). Both NOT IN conditions look redundant in your particular query.Tradein
If he wants more than 2 tags, you'll have to do another join for tag3. This will quickly get out of hand and very slow if he wants many tags.Roeder
I think this query won't return any results. Tag1 == Tag2 is possible because Tag1 != Tag2 is not being enforced. So you could have a situation where You get Image.key value 1 with tags A,B,C,D and not E,F. But you'll get tag1 = A, tag2 = A, as well as A,B A,C A,D B,A and all other permutations. So the count will end up being 8Roeder
If an image has tags A,E,F and ID 1. You will get rows of 1AA, 1AA, 1AE, 1EA, 1AF, 1EF, 1EE, 1FF, 1FE and the count(*) will be 2 because 1AA and 1AA will be accepted.Roeder

© 2022 - 2024 — McMap. All rights reserved.