JOIN versus EXISTS performance
Asked Answered
A

1

8

Generally speaking, is there a performance difference between using a JOIN to select rows versus an EXISTS where clause? Searching various Q&A web sites suggests that a join is more efficient, but I recall learning a long time ago that EXISTS was better in Teradata.

I do see other SO answers, like this and this, but my question is specific to Teradata.

For example, consider these two queries, which return identical results:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
join     MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1

The primary index (unique) on both tables is 'srv_accs_id'. MY_BASE_TABLE is rather large (200 million rows) and MY_TARGET_TABLE relatively small (200,000 rows).

There is one significant difference in the EXPLAIN plans: The first says the two tables are joined "by way of a RowHash match scan" and the second says "by way of an all-rows scan". Both say it is "an all-AMPs JOIN step" and the total estimated time is identical (0.32 seconds).

Both queries perform the same (I'm using Teradata 13.10).

A similar experiment to find non-matches comparing a LEFT OUTER JOIN with a corresponding IS NULL where clause to a NOT EXISTS sub-query does show a performance difference:

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
left outer join MY_TARGET_TABLE x
on       x.srv_accs_id=svc.srv_accs_id
where    x.srv_accs_id is null
group by 1
order by 1

-and-

select   svc.ltv_scr, count(*) as freq
from     MY_BASE_TABLE svc
where not exists(
    select 1
    from   MY_TARGET_TABLE x
    where  x.srv_accs_id=svc.srv_accs_id)
group by 1
order by 1 

The second query plan is faster (2.21 versus 2.14 seconds as described by EXPLAIN).

My example may be too trivial to see a difference; I'm just looking for coding guidance.

Architectural answered 3/12, 2012 at 22:35 Comment(2)
Do realise that the first set of queries (EXISTS) fetches .1% of the rows, the second set (NOT EXISTS) will fetch 99.9% of the rows. In terms of the number of fetched pages, the first query can use an index to minimise the number of page fetches, in the second one effectively all pages will be needed. That's why the planner chooses to scan all the pages in the second case. The difference between left join ... NULL and NOT EXISTS does not look too significant to me. (order? cache priming?) Disclaimer: I don't know teradata.Introgression
Yes, I realize that; I was just tried to make up an example to illustrate my question. Also, I just sat through a Teradata presentation on EXPLAIN and have learned that those are relative "cost" estimates and have nothing to do with "time".Architectural
S
7

NOT EXISTS is more efficient than using a LEFT OUTER JOIN to exclude records that are missing from the participating table using an IS NULL condition because the optimizer will elect to use an EXCLUSION MERGE JOIN with the NOT EXISTS predicate.

While your second test did not yield impressive results for the data sets you were using the performance increase from NOT EXISTS over a LEFT JOIN is very noticeable as your data volumes increase. Keep in mind that the tables will need to be hash distributed by the columns that participate in the NOT EXISTS join just like they would in the LEFT JOIN. Therefore, data skew can impact the performance of the EXCLUSION MERGE JOIN.

EDIT:

Typically, I would defer to EXISTS as a replacement for IN instead of using it for re-writing a join solution. This is especially true when the column(s) participating in the logical comparison can be NULL. That's not to say you couldn't use EXISTS in place of an INNER JOIN. Instead of an EXCLUSION JOIN you will end up with an INCLUSION JOIN. The INNER JOIN is in essence an inclusion join to begin with. I'm sure there are some nuances that I am overlooking but you can find those in the manuals if you wish to take the time to read them.

Swimming answered 4/12, 2012 at 18:35 Comment(4)
Great answer with respect to the second part of my question on LEFT OUTER JOIN. Any comment on the first part (JOIN versus EXISTS)?Architectural
Personally , I think this is a lousy anwer. Any sane query analyser would detect the anti-join. (if there would be a difference in timing, the plan generator would be very wrong). btw: in the old time, "not exists" would always be faster, just because it came first, and the other variants (though equivalent) would perform worse. The "JOIN...NULL" form is relatively new. The IN variant has always been a loser, because duplicates have to be avoided by the subquery (apart from the NULL problem)Introgression
Thanks for the clarification to my response @wildplasser. (Hence the upvote on your comment.) I failed to mention the impact that duplicates could introduce when using the IN clause and the fact that NOT EXISTS predates the use of JOIN ... NULL in the SQL lexicon.Swimming
There is an additional "human" advantage to the EXISTS syntax: it does not pollute the outer query with its local rangetable (but, beeing a correlated subquery, it does have to refer to the outer query, of course) This makes SELECT * FROM a WHERE EXISTS (SELECT * FROM B where b.xx= a.yy) possible; The select * would otherwise contain the entire rangetable, including duplicated names for joined columns.Introgression

© 2022 - 2024 — McMap. All rights reserved.