sql joins as venn diagram
Asked Answered
D

5

89

I've had trouble understanding joins in sql and came upon this image which I think might help me. The problem is that I don't fully understand it. For example, the join in the top right corner of the image, which colors the full B circle red and but only the overlap from A. The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.

Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?

Question: Working clockwise from the top right and finishing in the center, can someone provide more information about the representation of each sql image, explaining

a) why a join would be necessary in each case (for example, especially in situations where no data's taken from A or B i.e. where only A or B but not both is colored)

b) and any other detail that would clarify why the image is a good representation of the sql

sql join diagram

Divine answered 21/12, 2012 at 21:22 Comment(7)
This might be more helpful of a diagram Visual Explanation of Joins, it includes table data and there is also a description about each join type.Karlenekarlens
What's important in the top right diagram is that it's a RIGHT JOIN, so all of TableB's data (the table to the right of the JOIN statement) will be included and only data from TableA that matches a Key in TableB will be included.Selfexistent
The Venn diagram problem is that once you understand the joins it seems at first that they illustrate, but if you actually try to explain what such a diagram means then you find that it is complex, involves an ON condition that isn't mentioned, involves rows not in either operand and is interpreted differently for inner vs outer joins. (In particular, although they look like Venn diagrams they're not interpreted as Venn diagrams), Ie they don't illustrate let alone explain.Milliner
PS See my many comments at What is the difference between “INNER JOIN” and “OUTER JOIN”?. There is a certain Venn diagram relevant to illustrating how left, right, full & inner joins are related.Milliner
Say NO to Venn Diagrams When Explaining JOINsHines
@Karlenekarlens Re your comment on this question: That blog is more classic confused abuse of Venn diagrams & the author disavows that very blog post in its comments. (Although other claims in the disavowel show further confusions by them.) See my comments on the question & elsewhere at the link in my previous comment. You are only promoting more bad presentations & confusions by giving that link.Milliner
Venn Diagram for Natural JoinMilliner
L
15

I think your main underlying confusion is that when (for example) only A is highlighted in red, you're taking that to mean "the query only returns data from A", but in fact it means "the query only returns data for those cases where A has a record". The query might still contain data from B. (For cases where B does not have a record, the query will substitute NULL.)

Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?

If you mean — the image where A is entirely in white, and there's a red crescent-shape for the part of B that doesn't overlap with A, then: the reason that A appears in the query is, A is how it finds the records in B that need to be excluded. (If A didn't appear in the query, then Venn diagram wouldn't have A, it would only show B, and there'd be no way to distinguish the desired records from the unwanted ones.)

The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.

Quite right. For this reason, RIGHT JOINs are relatively uncommon; although a query that uses a LEFT JOIN can nearly always be re-ordered to use a RIGHT JOIN instead (and vice versa), usually people will write their queries with LEFT JOIN and not with RIGHT JOIN.

Liquidambar answered 21/12, 2012 at 21:30 Comment(0)
M
85

I agree with Cade about the limitations of Venn diagrams here. A more apposite visual representation might be this.

Tables

Tables

SELECT A.Colour, B.Colour FROM A CROSS JOIN B SQL Fiddle

The cross join (or cartesian product) produces a result with every combination of the rows from the two tables. Each table has 4 rows so this produces 16 rows in the result.

Cross Join

SELECT A.Colour, B.Colour FROM A INNER JOIN B ON A.Colour = B.Colour SQL Fiddle

The inner join logically returns all rows from the cross join that match the join condition. In this case five do.

Inner Join

SELECT A.Colour, B.Colour FROM A INNER JOIN B ON A.Colour NOT IN ('Green','Blue') SQL Fiddle

The inner join condition need not necessarily be an equality condition and it need not reference columns from both (or even either) of the tables. Evaluating A.Colour NOT IN ('Green','Blue') on each row of the cross join returns.

inner 2

An inner join condition of 1=1 would evaluate to true for every row in the cross join so the two are equivalent (SQL Fiddle).

SELECT A.Colour, B.Colour FROM A LEFT OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle

Outer Joins are logically evaluated in the same way as inner joins except that if a row from the left table (for a left join) does not join with any rows from the right hand table at all it is preserved in the result with NULL values for the right hand columns.

LOJ

SELECT A.Colour, B.Colour FROM A LEFT OUTER JOIN B ON A.Colour = B.Colour WHERE B.Colour IS NULL SQL Fiddle

This simply restricts the previous result to only return the rows where B.Colour IS NULL. In this particular case these will be the rows that were preserved as they had no match in the right hand table and the query returns the single red row not matched in table B. This is known as an anti semi join.

It is important to select a column for the IS NULL test that is either not nullable or for which the join condition ensures that any NULL values will be excluded in order for this pattern to work correctly and avoid just bringing back rows which happen to have a NULL value for that column in addition to the un matched rows.

loj is null

SELECT A.Colour, B.Colour FROM A RIGHT OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle

Right outer joins act similarly to left outer joins except they preserve non matching rows from the right table and null extend the left hand columns.

ROJ

SELECT A.Colour, B.Colour FROM A FULL OUTER JOIN B ON A.Colour = B.Colour SQL Fiddle

Full outer joins combine the behaviour of left and right joins and preserve the non matching rows from both the left and the right tables.

FOJ

Moribund answered 23/12, 2012 at 14:46 Comment(6)
+1 for apposite visual representation, and for increasing my vocabulary.Bloodshed
@MartinSmith This is a great explanation, but I am having trouble with the SQL fiddle. Nothing is listed in the schema. I have never used a SQL fiddle before.Labourer
@onebree Looks like SQL Fiddle doesn't preserve scripts indefinitely :-(Moribund
the best explanation, thank you! there should be some law against venn diagrams in explaining joins - I've spent so much time deceived by them!Farinose
Venn diagram was so confusing...absolutely useless for explaining SQL JOINS to rookies. They just treat the entire process as a set - like intersection, union, complement, etc but they absolutely do not elaborate or even hint implementation details (leaving the details to imagination of the student). - These diagrams with colors clear all the confusion and are very intuitive. I think everything that is taught should be this clear.Sponger
This is better, Say NO to Venn Diagrams When Explaining JOINsHines
L
15

I think your main underlying confusion is that when (for example) only A is highlighted in red, you're taking that to mean "the query only returns data from A", but in fact it means "the query only returns data for those cases where A has a record". The query might still contain data from B. (For cases where B does not have a record, the query will substitute NULL.)

Similarly, the image below that only includes data from the B circle, so why is A included at all in the join statement?

If you mean — the image where A is entirely in white, and there's a red crescent-shape for the part of B that doesn't overlap with A, then: the reason that A appears in the query is, A is how it finds the records in B that need to be excluded. (If A didn't appear in the query, then Venn diagram wouldn't have A, it would only show B, and there'd be no way to distinguish the desired records from the unwanted ones.)

The image makes it seem like circle B is the primary focus of the sql statement, but the sql statement itself, by starting with A (select from A, join B), conveys the opposite impression to me, namely that A would be the focus of the sql statement.

Quite right. For this reason, RIGHT JOINs are relatively uncommon; although a query that uses a LEFT JOIN can nearly always be re-ordered to use a RIGHT JOIN instead (and vice versa), usually people will write their queries with LEFT JOIN and not with RIGHT JOIN.

Liquidambar answered 21/12, 2012 at 21:30 Comment(0)
C
13

Venn diagrams are suitable for representing set operations such as UNION, INTERSECTS, EXCEPT etc.

To the extent only that those set operations like EXCEPT are simulated with things like LEFT JOIN WHERE rhs.KEY is NULL, this diagram is accurate.

Otherwise it is misleading. For instance, any join can cause rows to multiply if the join criteria are not 1:1. But sets are only allowed to contain distinct members, so those cannot be represented as set operations.

Then there is the CROSS JOIN or INNER JOIN ON 1 = 1 - this is neither analogous to the INNER JOIN as shown in this diagram, nor can the set which is produced be really described by a Venn diagram. Not to mention all the other possible triangular joins, self and anti-joins like:

lhs INNER JOIN rhs ON rhs.VALUE < lhs.VALUE (triangular)

or

SELF self1
INNER JOIN SELF self2
    ON self2.key <> self1.key
    AND self1.type = self2.type

(self cross and anti-join to find all similar family members except yourself - self1 and self2 are the same set and the result is a proper subset)

Sticking to joins on keys may be fine for the first few minutes of a tutorial, but this can lead to a poor path for learning what joins are about. I think this is what you have found.

This idea that Venn Diagrams can represent JOINs generally this way needs to go away.

Caldwell answered 22/12, 2012 at 14:39 Comment(2)
+1. I've always really hated this use of Venn diagrams; I feel like it intuitively represents the aspects of joins that are already straightforward, and does absolutely nothing to explain the aspects of joins that genuinely confuse people.Liquidambar
There is also an appropriate Venn diagram illustrating the difference between inner/left/right/full join results. See my comment on the question.Milliner
H
-1

When you do a join, it is likely that your two tables might not match up perfectly. Specifically, there could be some rows in A that don't match up to anything in B, or duplicate rows in A that match up with a single row in B, and vice-versa.

When this happens, you have a choice:

  1. for each A, take a single B that works, if there is one. (upper left)
  2. take each pair that fully matches (discard any that are missing either A or B--center)
  3. for each B, take a single A that works, if there is one (upper right)
  4. take EVERYTHING (lower left)

Center left and right are technically joins, but pointless ones; they could probably be more efficiently written SELECT <select_list> FROM TableA A WHERE A.Key NOT IN (SELECT B.Key FROM TableB B) (or the opposite).

In direct answer to your confusion, RIGHT JOIN says "the following expression is the focus of this query". Lower right is rather strange, and I see no reason why you would want that. It returns the results from the two outer middle queries, mixed together with NULL's in all of the columns for the opposite table.

Hightower answered 21/12, 2012 at 21:31 Comment(4)
Re: "Center left and right are technically joins, but pointless ones; they could probably be more efficiently written SELECT <select_list> FROM TableA A WHERE A.Key NOT IN (SELECT B.Key FROM TableB B) (or the opposite)": I don't think that's true. There are some DBMSes where they are equivalent, and there are some DMBSes where the LEFT JOIN approach is more efficient (e.g., MySQL until 5.5 or 5.6 or so); but I'm not aware of any DBMSes where the NOT IN approach is reliably more efficient, especially if B is large. Also, the NOT IN approach only works when joining on a single field.Liquidambar
Interesting; I would have thought that it would be more efficient to check each row, as opposed to joining it and then checking if the join was successful. I can totally see how multiple keys could be faster that way though.Hightower
@Liquidambar - In SQL Server for non nullable columns NOT IN and NOT EXISTS generally give the same plan and this plan can be more efficient than LOJ ... NULL (sometimes that does the join then filters out those matching the IS NULL in a separate filter). I would always use NOT EXISTS in preference of NOT IN though as more efficient if nullable columns and the semantics of NOT IN with NULL or never the ones I want anyway.Moribund
I forgot about NOT EXISTS--that was what I was actually looking for, but awkwardly imitated it with the NOT IN.Hightower
D
-1

For the right join, yes the syntax can be confusing, but yes it is what it seems to be. When you say "TableA RIGHT JOIN TableB", it is indeed saying that TableB is the main table that you are referring to and TableA is just hanging on where it has matching records. This does read weird in queries, because TableA is listed first so your brain automatically assigns more priority to it, even though TableB is really the more important table in the query. For this reason, you rarely actually see right joins in real code.

So instead of A and B, lets take two things that are easy to keep track of. Supposed we have two tables for people's info, ShoeSize and IQ. You have ShoeSize info for some people, some IQ info for some people. And have a PersonID on both tables that you can join on.

Clockwise from top right (even tho this starts with some of the more complicated and contrived cases):

  • ShoeSize RIGHT JOIN IQ -> give me all of the IQ information. Include any ShoeSize information for those people if we have it.
  • ShoeSize RIGHT JOIN IQ WHERE ShowSize.PersonID = NULL -> Give me all of the IQ info, but only for people who don't have any shoe size info
  • ShoeSize FULL OUTER JOIN IQ WHERE ShoeSize.PersonID = NULL AND IQ.PersonID = NULL -> Give me the shoe size info only for people who don't have IQ info, plus the IQ info for people who don't have shoe size info
  • ShoeSize FULL OUTER JOIN IQ -> Give me everything, all shoe sizes and all IQ data. If any ShoeSizes and IQ records have the same PersonID, include them in one row.
  • ShoeSize LEFT JOIN IQ WHERE IQ.PersonID = NULL -> Give me all of the shoe size info, but only for people that don't have IQ info
  • ShoeSize LEFT JOIN IQ -> Give me all of the shoe size info. Include any IQ information for those people if we have it.
Digress answered 21/12, 2012 at 21:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.