Join between mapping (junction) table with specific cardinality

B

5

6

I have a simple question about the most efficient way to perform a particular join.

Take these three tables, real names have been changed to protect the innocent:

Table: animal

animal_id   name   ...
======================
1           bunny
2           bear
3           cat
4           mouse

Table: tags

tag_id     tag
==================
1          fluffy
2          brown
3          cute
4          small

Mapping Table: animal_tag

animal_id   tag_id
==================
1           1
1           2
1           3
2           2
3           4
4           2

I want to find all animals that are tagged as 'fluffy', 'brown', and 'cute'. That is to say that the animal must be tagged with all three. In reality, the number of required tags can vary, but should be irrelevant for this discussion. This is the query I came up with:

SELECT * FROM animal
JOIN (
      SELECT at.animal_id FROM animal_tag at
      WHERE at.tag_id IN (
                          SELECT tg.tag_id FROM tag tg
                          WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute'
                          )
      GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3
      ) AS jt
ON animal.animal_id=jt.animal_id

On a table with thousands 'animals' and and hundreds of 'tags', this query performs respectably ... 10s of milliseconds. However, when i look at the query plan (Apache Derby is the DB), the optimizer's estimated cost is pretty high (9945.12) and the plan pretty extensive. For a query this "simple" I usually try to get query plans with an estimated cost of single or double digits.

So my question is, is there a better way to perform this query? Seems like a simple query, but I've been stumped coming up with anything better.

Barrel answered 7/2, 2012 at 2:8 Comment(6)

i think you should use AND instead of OR in WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute' – Elledge 7/2, 2012 at 2:42

@johntotetwoo No single row in tag matches more than a single value, so using AND would produce no matching rows. – Robers 7/2, 2012 at 3:0

@BrankoDimitrijevic you're right! my bad. what am i thinking. – Elledge 7/2, 2012 at 3:4

Have a look at this article about relational division. It should give you a couple of more things to try out. – Disassociate 7/2, 2012 at 3:18

@MikaelEriksson thanks for that excellent reference! – Barrel 7/2, 2012 at 4:14

Relational division is indeed what you are asking about, known as "the supplier who supplies all parts". Here's another useful article: On Making Relational Division Comprehensible. – Sudatory 7/2, 2012 at 8:55

B

1

First of all, a huge thanks to everyone who jumped in on this. Ultimately the answer is, as referenced by several commenters, relational division.

While I did take a course in Codd's relational data model many moons ago, the course like many, did not really cover relational division. Unwittingly, my original query is actually an application of Relational Division.

Referring to a slide 26-27 in this presentation on relational division, my query applies the technique of comparing set cardinalities. I tried some of the other methods mentioned for applying relational division but, at least in my case, the counting method provides the fastest run-time. I encourage anyone interested in this problem to read the aforementioned slide stack, as well as the article referenced on this page by Mikael Eriksson. Again, thanks to everyone.

Barrel answered 8/2, 2012 at 0:16 Comment(0)

S

1

You could create a temp table using DECLARE GLOBAL TEMPORARY TABLE And then do an INNER JOIN to eliminate the "WHERE IN". Working with Joins which are set based is usually far more efficient than Where statements that have to be evaluated for each row.

Soph answered 7/2, 2012 at 2:21 Comment(1)

in practice the query inside of the WHERE IN is optimized by the database such that it is only run once, because it has no dependencies on the outer query. In addition because it only returns (in this case 3 rows, or a small number in practice), the overhead of creating and selecting into a temporary table is greater than the original query cost. – Barrel 7/2, 2012 at 4:16

E

1

try this:

SELECT DISTINCT f.Animal_ID, g.Name
FROM Animal f INNER JOIN 
    (SELECT a.Animal_ID, a.Name, COUNT(*) as iCount
     FROM   Animal a INNER JOIN Animal_Tag b
                  ON a.Animal_ID = b.animal_ID
                     INNER JOIN Tags c
                  On b.tag_ID = c.tag_ID
    WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
    GROUP BY a.Animal_ID) g
WHERE g.iCount = 3 -- No. of tags

UPDATE

    SELECT DISTINCT a.Animal_ID, a.Name, COUNT(*) as iCount
    FROM    Animal a INNER JOIN Animal_Tag b
                  ON a.Animal_ID = b.animal_ID
                     INNER JOIN Tags c
                  On b.tag_ID = c.tag_ID
    WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
    GROUP BY Animal_ID
    HAVING  iCount = 3 -- No. of tags

Elledge answered 7/2, 2012 at 2:37 Comment(4)

Thanks, I appreciate the effort. This query is correct in that it produces the same result as my query. Unfortunately, when plugging it into our code it has a slightly higher estimated cost and slightly longer run time (our query is 0.28s, yours is 0.32s). Basically equivalent in terms of performance (at least with our dataset). Thanks again. – Barrel 7/2, 2012 at 3:13

@Barrel i updated that query. does it decrease that estimated cost? – Elledge 7/2, 2012 at 3:21

@johntotewoo I don't know why, but Derby doesn't like that query. Error: Column reference 'A.NAME' is invalid, or is part of an invalid expression. For a SELECT list with a GROUP BY, the columns and expressions being selected may only contain valid grouping expressions and valid aggregate expressions. – Barrel 7/2, 2012 at 3:47

I modified it a bit to make it work in Derby, but the estimated cost and execution time is still slightly higher. I guess my original query is about as good as I can expect. Thanks for the effort. – Barrel 7/2, 2012 at 3:53

J

1

Give this a spin:

SELECT a.*
FROM animal a
INNER JOIN 
  ( 
    SELECT at.animal_id
    FROM tag t
    INNER JOIN animal_tag at ON at.tag_id = t.tag_id
    WHERE tag IN ('fluffy', 'brown', 'cute')
    GROUP BY at.animal_id
    HAVING count(*) = 3
  ) f ON  a.animal_id = f.animal_id

Here is another option, just for the fun of it:

SELECT a.*
FROM animal a
INNER JOIN animal_tag at1 on at1.animal_id = a.animal_id
INNER JOIN tag t1 on t1.tag_id = at1.tag_id
INNER JOIN animal_tag at2 on at2.animal_id = a.animal_id
INNER JOIN tag t2 on t2.tag_id = at2.tag_id
INNER JOIN animal_tag at3 on at3.animal_id = a.animal_id
INNER JOIN tag t3 on t3.tag_id = at3.tag_id
WHERE t1.tag = 'fluffy' AND t2.tag = 'brown' AND t3.tag = 'cute'

I don't really expect this last option to do well... the other options avoid needing to go back to the tag table multiple times to resolve a tag name from the id... but you never know what the query optimizer will do until you try it.

Jacqulynjactation answered 7/2, 2012 at 3:59 Comment(4)

Excellent. The first query is not an option with Apache Derby as it does not support the WITH statement. But the second option is interesting. It comes in with a lower optimizer cost (5966.82) than my original, but in practice the run-time is about 10% longer (averaged over 10 runs). – Barrel 7/2, 2012 at 4:13

@Barrel - rewrote the first query to skip the cte. – Jacqulynjactation 7/2, 2012 at 4:18

Interestingly your revised first query compiles to exactly the same access plan as my query, including an exact estimated cost (9945.12). – Barrel 7/2, 2012 at 4:25

@Barrel - The first version provided by Joel Coehoorn is what I would have suggested trying as well. If you could figure out the tag id's before hand you could remove the tag table from the query and do the where clause in the sub query like where tag_id in (1, 2, 3) – Disassociate 7/2, 2012 at 5:1

B

1

First of all, a huge thanks to everyone who jumped in on this. Ultimately the answer is, as referenced by several commenters, relational division.

While I did take a course in Codd's relational data model many moons ago, the course like many, did not really cover relational division. Unwittingly, my original query is actually an application of Relational Division.

Referring to a slide 26-27 in this presentation on relational division, my query applies the technique of comparing set cardinalities. I tried some of the other methods mentioned for applying relational division but, at least in my case, the counting method provides the fastest run-time. I encourage anyone interested in this problem to read the aforementioned slide stack, as well as the article referenced on this page by Mikael Eriksson. Again, thanks to everyone.

Barrel answered 8/2, 2012 at 0:16 Comment(0)

B

0

I was wondering how bad it would be to use a relational division there. Can you please give it a run? I know this will take more, but I'm intrigued by how much :) If you can provide both the estimated cost and the time, it would be great.

select a2.animal_id, a2.animal_name from animal2 a2
where not exists (
    select * from animal1 a1, tags t1
    where not exists (
        select * from animal_tag at1
        where at1.animal_id = a1.animal_id and at1.animal_tag = t1.tag_id
    ) and a2.animal_id = a1.animal_id and t1.tag in ('fluffy', 'brown', 'cute')
)

Now looking for a fast query, I can't think in any faster than john's or yours. Actually john's might be a little slower than yours because he's performing unnencesary operations (remove distinct and remove count(*) from select):

SELECT a.Animal_ID, a.Name FROM Animal a
INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID
INNER JOIN Tags c On b.tag_ID = c.tag_ID
WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
GROUP BY Animal_ID, a.Name
HAVING count(*) = 3 -- No. of tags

This should be as fast as yours.

PS: Is there any way to remove that damned 3 without duplicating the where clause? My brain is boiling :)

Beardsley answered 7/2, 2012 at 5:29 Comment(1)

CTE's would allow you to remove the redundancy because you could reference the CTE twice in the main query (the 2nd time would be a count(*) query to get the number). But Derby doesn't support them. – Jacqulynjactation 7/2, 2012 at 15:32

Recommended topics

Hot tags