Join vs. sub-query
Asked Answered
D

20

1113

I am an old-school MySQL user and have always preferred JOIN over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.

I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN and therefore is there nothing to worry about?

Diminution answered 5/4, 2010 at 6:22 Comment(11)
Subqueries are great sometimes. They suck performance-wise in MySQL. Don't use them.Teth
I was always under the impression that sub-queries implicitly were executed as joins where available in certain DB technologies.Ebullient
Sub queries don't always suck, when joining with pretty large tables, the preferred way is to do a sub-select from that large table (limiting the number of rows) and then joining.Caustic
"nowadays everyone uses sub-query" [citation needed]Macrae
Potentially related (although much more specific): https://mcmap.net/q/54082/-subqueries-vs-joins/…Marelya
@Piskvor Considering that MySQL didn't support subqueries until 4.1, to an "old-school MySQL user", seeing just a few people using subqueries must seem like "everyone". :)Fluorene
Why do you "hate" sub-queries?Cantone
@runrig, Is your comment still true now?Japheth
@Japheth - I have no idea...haven't used mysql for years.Teth
@runrig, Oh.. Thanks anyways. What db technology do you use then?Japheth
@Starx, I can attest to the fact that, yes, it is still true. I have a sub-query on a couple of monster tables that took all night before I killed the processes. The Join completed in 0.93 seconds.Octavo
C
270

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Chamade answered 20/9, 2016 at 8:9 Comment(7)
@user1735921 IMO it depends ... Generally, it's very important the readability of the code, because it is of great importance for later management of it... Let's remember the famous statement of Donald Knuth: "Premature optimization is the root of all evil (or at least most of it) in programming". However, naturally there are programming areas where performance is paramount... Ideally, when one succeeds in reconciling one with another :)Chamade
In more complex queries, I find joins much easier to read than sub-queries. sub-queries turn into a bowl of noodles in my head.Linwoodlinz
@user1735921 sure, especially when the query gets so complicated that it does the wrong thing and you spend a day fixing it... there's a balance in between, as usual.Hulsey
Joins have higher readability as well as performance but if you are using way too many joins (like 5-10) in the query it would better to think of some optimization and breaking into two queries or sub queries which will increase performance as well as readabilty, thats what I learnt from my experience.Frustum
@Frustum Only if the performance gains are worth the increase in maintenance time required in the futureDeboradeborah
My opinion Join and sub query has different syntax, so readability we can not compare, both have higher readability as long as you are good in SQL syntax. Performance is more important.Tribute
true story, join is faster than subqueryFlyn
S
1099

Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.

Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.

Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.

Surrounding answered 5/4, 2010 at 6:26 Comment(11)
Great answer. I'd also add that developers (esp. amateur ones) are not always proficient in SQL.Downstairs
+1 Looking for some logical explanation for this issue for a long time, this is only answer that seems logical to meOutfield
@Marcelo Cantos, Could you please give an example of your statement "It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B."? I found this very insightful yet a little too abstract. Thanks.Mores
@JinghuiNiu Customers who bought expensive items: select custid from cust join bought using (custid) where price > 500. If a customer bought multiple expensive items, you'll get double-ups. To fix this, select custid from cust where exists (select * from bought where custid = cust.custid and price > 500). You could use select distinct … instead, but it's often more work, either for the optimizer or the evaluator.Surrounding
Do you even need to do a join in your example though? Wouldn't SELECT custid FROM bought WHERE price > 500 suffice? Sorry if this seems pedantic, but your example is confusing me a littleSacksen
@Sacksen yes I used an oversimplified answer coz I was lazy. In a real scenario you would be pulling more columns than just custid from cust.Surrounding
Eternal dilemma between writing code for machine versus for humans. Some would argue don't worry about performance until it is an issue, I'd add : test test test the claim that joins are significantly faster than sub-queriesRandle
@Randle agreed but it's only a dilemma historically because SQL engines sucked at optimisations. It's not entirely the fault of the authors insomuch as SQL semantics make it much more difficult to rewrite expressions than it would be in a properly relational engine. Modern SQL engines are much better at this, but it has been a long hard road to get here, and there's still a long way to go, imo.Surrounding
I did this mistake in the past and some people i know keep making this mistake with Join: SELECT SUM(A.Something) AS SumSomething, ... FROM A LEFT JOIN B ON B.A_ID = A.ID So yes, subqueries are safer for non-superprofessionals.Actinometer
@MarceloCantos great answer, but I wish that you had provided a concrete example to illustrate what is meant by "Get facts from A, conditional on facts from B". YOu mean like when the WHERE conditions applies to both tables? Sorry I am not able to connect abstract to exampleMccrae
@Mccrae An example might be when A is customers, B is accounts, and you want all customers that have no accounts with a balance over $1000.Surrounding
S
406

In most cases JOINs are faster than sub-queries and it is very rare for a sub-query to be faster.

In JOINs RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.

The good thing in sub-queries is that they are more readable than JOINs: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.

Sextillion answered 5/4, 2010 at 6:39 Comment(15)
Yes, most databases therefore includes it as an optimization step to convert subqueries into joins when it is analyzing your query.Raw
This answer is a bit too simplified for the question that was asked. As you state: certain subqueries are ok and certain are not. The answer does not really help to distinguish the two. (also the 'very rare' really depends on your data/app).Update
@Sextillion you are totally right, although, i think JOINS are easier to read when they are well formated.Taishataisho
can you prove any of your points with documentation reference or test results?Accomplish
I made very good experiences with sub-queries that contain a back-reference to the upper query, especially when it comes to row-counts above 100,000. The thing seems to be memory usage and paging to the swap-file. A join would produce a very big amount of data, that may not fit into memory and must be paged into the swap-file. Whenever this is the case the query-times of small sub-selects like select * from a where a.x = (select b.x form b where b.id = a.id) is extremely small compared to a join. This is a very specific problem, but in some cases it brings you from hours to minutes.Overwrought
I'm experienced with Oracle and I can say, sub-queries are much better on large tables if you don't have any filtering or sorting on them.Copeck
I've similar experience with SQLServer. Sub-queries are generally much faster.Draconic
I have read that MySQL specifically doesn't handle nested subqueries well. Postgres is apparently better.Tellurion
I was told to prefer joins and leave to the optimization to SQL Server as SQL Server architects spend much of their time working on ways to optimize joins. HOWEVER, in practice I found that approach often yielded terrible results and, IMHO, it is an abdication of responsibility. Joins have their own special dangers and risks which are too often ignored (until reality bites). Each approach has strengths and weaknesses; experiment, measurement and experience will ultimately be your best guide for any give situation.Semiliterate
This answer was written 8 years ago. Is this still true today (as of March 1 2018)?Howler
@Raw Which DBMSs can do such an optimization?Sorority
@Sorority Check the query plan for your specific query to see if your database has done the right optimization. It depends on more factors than just the DBMS supporting it, as some of the other answers states.Raw
Last year I started working with Postgre. I started a query with a join and for some reason I reworked it as a sub-query and I showed the result to my colleague because I was very surprised. The sub query one was much faster.Highly
sub-query is really much faster than joins if table has big data (say millions). Some ppl treat sub-query as anti-pattern but please, speed is one of the most important factors in majority cases.Bushweller
A lot of introductions to SQL also include the blanket statement that "joins are slow / computationally expensive" without including that subqueries often are much slower. The important thing is that subqueries and joins are not equivalent and you should try both ways.Handyman
C
270

Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):

A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.

So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.

Chamade answered 20/9, 2016 at 8:9 Comment(7)
@user1735921 IMO it depends ... Generally, it's very important the readability of the code, because it is of great importance for later management of it... Let's remember the famous statement of Donald Knuth: "Premature optimization is the root of all evil (or at least most of it) in programming". However, naturally there are programming areas where performance is paramount... Ideally, when one succeeds in reconciling one with another :)Chamade
In more complex queries, I find joins much easier to read than sub-queries. sub-queries turn into a bowl of noodles in my head.Linwoodlinz
@user1735921 sure, especially when the query gets so complicated that it does the wrong thing and you spend a day fixing it... there's a balance in between, as usual.Hulsey
Joins have higher readability as well as performance but if you are using way too many joins (like 5-10) in the query it would better to think of some optimization and breaking into two queries or sub queries which will increase performance as well as readabilty, thats what I learnt from my experience.Frustum
@Frustum Only if the performance gains are worth the increase in maintenance time required in the futureDeboradeborah
My opinion Join and sub query has different syntax, so readability we can not compare, both have higher readability as long as you are good in SQL syntax. Performance is more important.Tribute
true story, join is faster than subqueryFlyn
P
201

In the year 2010 I would have joined the author of this questions and would have strongly voted for JOIN, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:

First of all, let me say the most important: There are different forms of sub-queries

And the second important statement: Size matters

If you use sub-queries, you should be aware of how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row! On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.

Sub-queries in Select-Fields

SELECT moo, (SELECT roger FROM wilco WHERE moo = me) AS bar FROM foo

Be aware that a sub-query is executed for every resulting row from foo.
Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to foo it can be optimized by the DB-server as static content and could be evaluated only once.

Sub-queries in the Where-statement

SELECT moo FROM foo WHERE bar = (SELECT roger FROM wilco WHERE moo = me)

If you are lucky, the DB optimizes this internally into a JOIN. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row in foo, not just the results like in the select-type.

Sub-queries in the Join-statement

SELECT moo, bar 
  FROM foo 
    LEFT JOIN (
      SELECT MIN(bar), me FROM wilco GROUP BY me
    ) ON moo = me

This is interesting. We combine JOIN with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows in wilco but only a few distinct me. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect with CREATE TEMPORARY TABLE ... and INSERT INTO ... SELECT ..., which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).

Nested sub-queries

SELECT VARIANCE(moo)
  FROM (
    SELECT moo, CONCAT(roger, wilco) AS bar
      FROM foo
      HAVING bar LIKE 'SpaceQ%'
  ) AS temp_foo
  GROUP BY moo

You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or change the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need some operations on the whole table, only on the resultset. This might provide a much better performance depending on the size of the table.

Conclusion

Sub-queries are no replacement for a JOIN and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement of CREATE TEMPORARY TABLE .... A good sub-query reduces a dataset in a way you cannot accomplish in an ON statement of a JOIN. If a sub-query has one of the keywords GROUP BY or DISTINCT and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.

Promontory answered 9/4, 2018 at 17:41 Comment(10)
For Sub-queries in the Join-statement: (1) generating derived table from the sub-query itself could take a very long time. (2) the resulting derived table is not indexed. these two alone could significantly slow down the SQL.Galliot
@Galliot I can only speak for MySQL (1) There it is a temporary table similar to a join. The time depends on the amount of data. If you cannot reduce the data with a subquery, use a join. (2) This is right, it depends on the factor you can reduce the data in the temporary table. I had real world cases, where I could reduce the join size form some millions to a few hundred and reducing the query time from multiple seconds (with full index usage) to a quarter of a second with a subquery.Promontory
IMO: (1) such temporary table (derived table) is not materialized, thus each time you run the SQL, the temporary table must be recreated, that could be very costly and a real bottle-neck (i.e. running a group by on millions of records) (2) even if you can reduce the size of temp table to 10 records, since there is no index, that still means potentially to query 9 times more data records than w/o the temp table when JOINing other tables. BTW I had this issue before with my db(MySQL), in my case, using sub-query in SELECT list could be much faster.Galliot
@Galliot I do not doubt that there are a lot of examples, where using a subquery is less optimal. As good practice you should use EXPLAIN on a query before optimizing. With the old set profiling=1 you could easily see, if a temporary table is a bottleneck. And even an index needs processing time, B-Trees optimize querying for records, but a 10 record table can be much faster than an index for millions of records. But it depends on multiple factors like field sizes and types.Promontory
Using EXPLAIN should be enough to see how the DERIVED table(created from sub-query in the FROM list) could impact the query. I use sub-queries a lot at work, just try to mention that sub-queries in the Join-statement is probably not as promising as you think. the reduced records from the temporary table could have a much bigger cost than its gain. Also remember, even if the final JOINs could take less time, the time to scan millions of records in the sub-query remain and need to count for each SQL run.Galliot
@Galliot Perhaps our different experience results in the different domains we do our queries. Sub-queries in the Join-statement are certainly promising, but perhaps I missed the real key constraint which is essential for the huge performance gain I get. Perhaps the aggregation functions like MIN() are the tipping point which result in a slightly different join strategy, though I already state that aggregation is a key factor for using sub queries.Promontory
Good explanation but I'm not agree with the last example. Since GROUP BY and HAVING commands are executed before the ORDER BY command by the optimizer, this sub-query should not improve the performance of the query.Beaubeauchamp
@SimoneG Thanks, you are right. I changed the example. Did not thought on execution ordering on creating the former shortened example. Hopefully the new example is correct. Better ideas are welcome.Promontory
@Promontory Big thanks for your answer. Would you consider to edit the tables and column names? I believe many ppl who are novice at databases (that includes me) would really find it much easier to understand, if it was easy to tell a difference between tables and columns.Panegyrize
As mentioned enough here: it depends heavily on your circumstances. So just try it out. I thought the SELECT ... INNER JOIN (subquery) looked like a really smart idea which then turned out to run twice as long as SELECT ... IN (subquery). Main table ~100,000 records subquery <10 records retrieved from the main table.Bertberta
D
151

Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...

PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.

Doty answered 5/4, 2010 at 7:37 Comment(4)
this is exactly why postgresql is so good and useful it understands what the goal is and will fix a query based on what it think is better and postgresql is very good at knowing how to look at its dataDenisdenise
heww. I guess no need to re-write tons of queries for me! postgresql for the win.Planck
I've got a left join much slower than sub query, so I don't think it will really do the job.Intrauterine
@Intrauterine it depends on how you have set up your database (e.g. indexes), especially when using mysql. You can mess up a lot with the strategy of the database based on your configuration, that results in such things. You can't assume, that it doesn't do the job based on your 'experience', that had happened to you that one time.Lenwood
U
44

First of all, to compare the two first you should distinguish queries with subqueries to:

  1. a class of subqueries that always have corresponding equivalent query written with joins
  2. a class of subqueries that can not be rewritten using joins

For the first class of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.

These days even mysql does that.

Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).

Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.

For the second class the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.

Update answered 28/5, 2010 at 9:33 Comment(1)
can you provide an example of a query written using sub-queries that can not be converted to joins (second class, as you call it)?Linwoodlinz
F
27

I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.

(although Marcelo Cantos does mention it)

I will cite the example from Stanford's Lagunita courses on SQL.

Student Table

+------+--------+------+--------+
| sID  | sName  | GPA  | sizeHS |
+------+--------+------+--------+
|  123 | Amy    |  3.9 |   1000 |
|  234 | Bob    |  3.6 |   1500 |
|  345 | Craig  |  3.5 |    500 |
|  456 | Doris  |  3.9 |   1000 |
|  567 | Edward |  2.9 |   2000 |
|  678 | Fay    |  3.8 |    200 |
|  789 | Gary   |  3.4 |    800 |
|  987 | Helen  |  3.7 |    800 |
|  876 | Irene  |  3.9 |    400 |
|  765 | Jay    |  2.9 |   1500 |
|  654 | Amy    |  3.9 |   1000 |
|  543 | Craig  |  3.4 |   2000 |
+------+--------+------+--------+

Apply Table

(applications made to specific universities and majors)

+------+----------+----------------+----------+
| sID  | cName    | major          | decision |
+------+----------+----------------+----------+
|  123 | Stanford | CS             | Y        |
|  123 | Stanford | EE             | N        |
|  123 | Berkeley | CS             | Y        |
|  123 | Cornell  | EE             | Y        |
|  234 | Berkeley | biology        | N        |
|  345 | MIT      | bioengineering | Y        |
|  345 | Cornell  | bioengineering | N        |
|  345 | Cornell  | CS             | Y        |
|  345 | Cornell  | EE             | N        |
|  678 | Stanford | history        | Y        |
|  987 | Stanford | CS             | Y        |
|  987 | Berkeley | CS             | Y        |
|  876 | Stanford | CS             | N        |
|  876 | MIT      | biology        | Y        |
|  876 | MIT      | marine biology | N        |
|  765 | Stanford | history        | Y        |
|  765 | Cornell  | history        | N        |
|  765 | Cornell  | psychology     | Y        |
|  543 | MIT      | CS             | N        |
+------+----------+----------------+----------+

Let's try to find the GPA scores for students that have applied to CS major (regardless of the university)

Using a subquery:

select GPA from Student where sID in (select sID from Apply where major = 'CS');

+------+
| GPA  |
+------+
|  3.9 |
|  3.5 |
|  3.7 |
|  3.9 |
|  3.4 |
+------+

The average value for this resultset is:

select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');

+--------------------+
| avg(GPA)           |
+--------------------+
| 3.6800000000000006 |
+--------------------+

Using a join:

select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';

+------+
| GPA  |
+------+
|  3.9 |
|  3.9 |
|  3.5 |
|  3.7 |
|  3.7 |
|  3.9 |
|  3.4 |
+------+

average value for this resultset:

select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';

+-------------------+
| avg(GPA)          |
+-------------------+
| 3.714285714285714 |
+-------------------+

It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value. It is also evident that usage of distinct with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9 score. The correct case is to account for TWO (2) occurrences of the 3.9 score given that we actually have TWO (2) students with that score that comply with our query criteria.

It seems that in some cases a sub-query is the safest way to go, besides any performance issues.

Frogmouth answered 19/10, 2016 at 10:5 Comment(1)
I think you can't use a sub-query here. This isn't a case where you logically can use either but one gives a wrong answer because of the technical implementation of it. This is a case where you CANNOT use a sub-query because a student not belonging to CS can score 3.9 which is in the IN list of scores. The context of CS is lost once sub-query is executed,which isn't what we want logically. So this isn't a good example where either can be used. Usage of sub-query is conceptually/logically wrong for this use-case even if luckily it gives right result for a different dataset.Castleman
C
22

MSDN Documentation for SQL Server says

Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.

so if you need something like

select * from t1 where exists select * from t2 where t2.parent=t1.id

try to use join instead. In other cases, it makes no difference.

I say: Creating functions for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.

Clutter in code is a big problem and the industry has been working on avoiding it for decades.

Continuate answered 16/11, 2011 at 9:50 Comment(3)
Replacing subqueries with functions is a very bad idea performance-wise in some RDBMS (e.g. Oracle), so I'd recommend just the opposite - use subqueries/joins instead of functions wherever possible.Tyranny
@FrankSchmitt please support your argument with references.Accomplish
There are also cases where you should use a sub query instead of a join even if you check for existence: if you check for NOT EXISTS. A NOT EXISTS wins over a LEFT OUTER JOIN for various reasons: preformance, fail-safety (in case of nulable columns) and readability. sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-joinMeat
D
21

As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.

But in the case that a table has more than 100,000 records then a subquery is best result.

I have one table that has 500,000 records on that I created below query and its result time is like

SELECT * 
FROM crv.workorder_details wd 
inner join  crv.workorder wr on wr.workorder_id = wd.workorder_id;

Result : 13.3 Seconds

select * 
from crv.workorder_details 
where workorder_id in (select workorder_id from crv.workorder)

Result : 1.65 Seconds

Discard answered 10/4, 2017 at 9:59 Comment(6)
I agree, sometimes breaking the query also works, when you have million records, you don't want to use joins because they take for ever. Rather handle it in the code and map in the code is better.Frustum
Tie your joins aren't working fast enough, you might be missing an index. Query Analyzer can be pretty helpful in comparing actual performance.Wallaby
I agree with Ajay Gajera, I have seen this for myself.Frustum
How does it make any sense to compare the performance of two queries which return different results?Prolix
Yes those are different queries but returning same resultCordillera
@anikislamshojib take a closer look: we have * and two tables in the first statement but only one table in the second statement, so I would expect a difference in the column count.Lanfranc
S
17
  • A general rule is that joins are faster in most cases (99%).
  • The more data tables have, the subqueries are slower.
  • The less data tables have, the subqueries have equivalent speed as joins.
  • The subqueries are simpler, easier to understand, and easier to read.
  • Most of the web and app frameworks and their "ORM"s and "Active record"s generate queries with subqueries, because with subqueries are easier to split responsibility, maintain code, etc.
  • For smaller web sites or apps subqueries are OK, but for larger web sites and apps you will often have to rewrite generated queries to join queries, especial if a query uses many subqueries in the query.

Some people say "some RDBMS can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other.", but this statement applies to simple cases, surely not for complicated queries with subqueries which actually cause a problems in performance.

Security answered 23/9, 2018 at 10:36 Comment(1)
> but this statement applies to simple cases I understand that it's either a simple case that can be rewritten to "JOIN" by RDBMS, or it's such a complex case that subqueries are appropriate here. :-) Nice point on ORMs. I think this has the greatest impact.Lisalisabet
G
16

Subqueries are generally used to return a single row as an atomic value, though they may be used to compare values against multiple rows with the IN keyword. They are allowed at nearly any meaningful point in a SQL statement, including the target list, the WHERE clause, and so on. A simple sub-query could be used as a search condition. For example, between a pair of tables:

SELECT title 
FROM books 
WHERE author_id = (
    SELECT id 
    FROM authors 
    WHERE last_name = 'Bar' AND first_name = 'Foo'
);

Note that using a normal value operator on the results of a sub-query requires that only one field must be returned. If you're interested in checking for the existence of a single value within a set of other values, use IN:

SELECT title 
FROM books 
WHERE author_id IN (
    SELECT id FROM authors WHERE last_name ~ '^[A-E]'
);

This is obviously different from say a LEFT-JOIN where you just want to join stuff from table A and B even if the join-condition doesn't find any matching record in table B, etc.

If you're just worried about speed you'll have to check with your database and write a good query and see if there's any significant difference in performance.

Groom answered 5/4, 2010 at 6:34 Comment(0)
O
16

Run on a very large database from an old Mambo CMS:

SELECT id, alias
FROM
  mos_categories
WHERE
  id IN (
    SELECT
      DISTINCT catid
    FROM mos_content
  );

0 seconds

SELECT
  DISTINCT mos_content.catid,
  mos_categories.alias
FROM
  mos_content, mos_categories
WHERE
  mos_content.catid = mos_categories.id;

~3 seconds

An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.

And...

SELECT
  DISTINCT mos_categories.id,
  mos_categories.alias
FROM
  mos_content, mos_categories
WHERE
  mos_content.catid = mos_categories.id;

0 seconds

Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.

Outstay answered 20/10, 2011 at 22:27 Comment(2)
i'd like to know more about what you are trying to point out in the last line "My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does." . Are you saying that an id should be named only id and not named something like catid ? Trying to optimize my db accesses, and your learnings could help.Sidell
using SQL IN in that case is a bad practice and it doesn't prove anything.Accomplish
M
12

MySQL version: 5.5.28-0ubuntu0.12.04.2-log

I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.

Here is my query with 3 sub-queries:

EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date 
FROM `vote-ranked-listory` vrl 
INNER JOIN lists l ON l.list_id = vrl.list_id 
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION' 
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5 
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000 
 AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=43) IS NULL 
 AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=55) IS NULL 
 AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL 
ORDER BY vrl.moved_date DESC LIMIT 200;

EXPLAIN shows:

+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| id | select_type        | table    | type   | possible_keys                                       | key          | key_len | ref                                             | rows | Extra                    |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
|  1 | PRIMARY            | vrl      | index  | PRIMARY                                             | moved_date   | 8       | NULL                                            |  200 | Using where              |
|  1 | PRIMARY            | l        | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY      | 4       | ranker.vrl.list_id                              |    1 | Using where              |
|  1 | PRIMARY            | vrlih    | eq_ref | PRIMARY                                             | PRIMARY      | 9       | ranker.vrl.list_id,ranker.vrl.ontology_id,const |    1 | Using where              |
|  1 | PRIMARY            | lbs      | eq_ref | PRIMARY,idx_list_burial_state,burial_score          | PRIMARY      | 4       | ranker.vrl.list_id                              |    1 | Using where              |
|  4 | DEPENDENT SUBQUERY | list_tag | ref    | list_tag_key,list_id,tag_id                         | list_tag_key | 9       | ranker.l.list_id,const                          |    1 | Using where; Using index |
|  3 | DEPENDENT SUBQUERY | list_tag | ref    | list_tag_key,list_id,tag_id                         | list_tag_key | 9       | ranker.l.list_id,const                          |    1 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | list_tag | ref    | list_tag_key,list_id,tag_id                         | list_tag_key | 9       | ranker.l.list_id,const                          |    1 | Using where; Using index |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+

The same query with JOINs is:

EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date 
FROM `vote-ranked-listory` vrl 
INNER JOIN lists l ON l.list_id = vrl.list_id 
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION' 
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5 
LEFT JOIN list_tag lt1 ON lt1.list_id = vrl.list_id AND lt1.tag_id = 43 
LEFT JOIN list_tag lt2 ON lt2.list_id = vrl.list_id AND lt2.tag_id = 55 
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403 
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000 
AND lt1.list_id IS NULL AND lt2.tag_id IS NULL 
ORDER BY vrl.moved_date DESC LIMIT 200;

and the output is:

+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type   | possible_keys                                       | key          | key_len | ref                                         | rows | Extra                                        |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | lt3   | ref    | list_tag_key,list_id,tag_id                         | tag_id       | 5       | const                                       | 2386 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | l     | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY      | 4       | ranker.lt3.list_id                          |    1 | Using where                                  |
|  1 | SIMPLE      | vrlih | ref    | PRIMARY                                             | PRIMARY      | 4       | ranker.lt3.list_id                          |  103 | Using where                                  |
|  1 | SIMPLE      | vrl   | ref    | PRIMARY                                             | PRIMARY      | 8       | ranker.lt3.list_id,ranker.vrlih.ontology_id |   65 | Using where                                  |
|  1 | SIMPLE      | lt1   | ref    | list_tag_key,list_id,tag_id                         | list_tag_key | 9       | ranker.lt3.list_id,const                    |    1 | Using where; Using index; Not exists         |
|  1 | SIMPLE      | lbs   | eq_ref | PRIMARY,idx_list_burial_state,burial_score          | PRIMARY      | 4       | ranker.vrl.list_id                          |    1 | Using where                                  |
|  1 | SIMPLE      | lt2   | ref    | list_tag_key,list_id,tag_id                         | list_tag_key | 9       | ranker.lt3.list_id,const                    |    1 | Using where; Using index                     |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+

A comparison of the rows column tells the difference and the query with JOINs is using Using temporary; Using filesort.

Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.

If I do not have the INNER JOIN on the list_tag table i.e. if I remove

AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL  

from the first query and correspondingly:

INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403

from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.

Muddleheaded answered 6/6, 2013 at 18:25 Comment(2)
I have similar situation, but with more joins than yours, will try with explain onceNystatin
In Oracle or PostgreSQL I would have tried: AND NOT EXISTS (SELECT 1 FROM list_tag WHERE list_id=l.list_id AND tag_id in (43, 55, 246403))Scarlett
B
10

Subqueries have ability to calculate aggregation functions on a fly. E.g. Find minimal price of the book and get all books which are sold with this price. 1) Using Subqueries:

SELECT titles, price
FROM Books, Orders
WHERE price = 
(SELECT MIN(price)
 FROM Orders) AND (Books.ID=Orders.ID);

2) using JOINs

SELECT MIN(price)
     FROM Orders;
-----------------
2.99

SELECT titles, price
FROM Books b
INNER JOIN  Orders o
ON b.ID = o.ID
WHERE o.price = 2.99;
Bargeboard answered 17/6, 2015 at 7:24 Comment(3)
Another case: multiple GROUP BYs with different tables: #11415784 Subqueries seem to be strictly more general. See also the MySQL man: dev.mysql.com/doc/refman/5.7/en/optimizing-subqueries.html | dev.mysql.com/doc/refman/5.7/en/rewriting-subqueries.htmlHeroine
-1 This is misleading as you're using a subquery and join in both examples. That you've pulled the subquery out into a second query to determine the lowest order price has no effect since the database will do the exact same thing. Plus, you're not rewriting the join using a subquery; both queries use a join. You are correct that subqueries allow aggregate functions, but this example doesn't demonstrate that fact.Behest
I agree with David, and you can use group by to get the minimum price.Frustum
E
4

The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...

We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.

But in the mean time the join queries are working with other tables that have fewer entries than the primary table.

So I think the join and sub query statements are working fine and it depends on the data and the situation.

Elizbeth answered 10/3, 2014 at 10:42 Comment(0)
R
3

These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.

Rutherfordium answered 2/2, 2012 at 18:25 Comment(0)
S
2

I am not a relational database expert, so take this with a grain of salt.

The general idea about sub queries vs joins is the path the evaluation of the larger query takes.

In order to perform the larger query, every individual subquery has to be executed first, and then the resultset is stored as a temporary table that the larger query interacts with.

This temporary table is unindexed, so, any comparison requires scanning the whole resultset.

In contrast, when you use a join, all indexes are in use and so, comparison require traversing index trees (or hash tables), which is way less expensive in terms of speed.

Now, what I don't know if newer versions of the most popular relational engines execute the evaluation on reverse, and just load the necessary elements in the temporary table, as an optimization method.

Standpipe answered 23/8, 2021 at 21:4 Comment(0)
E
1

I just thinking about the same problem, but I am using subquery in the FROM part. I need connect and query from large tables, the "slave" table have 28 million record but the result is only 128 so small result big data! I am using MAX() function on it.

First I am using LEFT JOIN because I think that is the correct way, the mysql can optimalize etc. Second time just for testing, I rewrite to sub-select against the JOIN.

LEFT JOIN runtime: 1.12s SUB-SELECT runtime: 0.06s

18 times faster the subselect than the join! Just in the chokito adv. The subselect looks terrible but the result ...

Emaemaciate answered 23/9, 2018 at 9:4 Comment(0)
H
1

It depends on several factors, including the specific query you're running, the amount of data in your database. Subquery runs the internal queries first and then from the result set again filter out the actual results. Whereas in join runs the and produces the result in one go.

The best strategy is that you should test both the join solution and the subquery solution to get the optimized solution.

Hacienda answered 12/9, 2020 at 19:15 Comment(0)
E
-2

If you want to speed up your query using join:

For "inner join/join", Don't use where condition instead use it in "ON" condition. Eg:

     select id,name from table1 a  
   join table2 b on a.name=b.name
   where id='123'

 Try,

    select id,name from table1 a  
   join table2 b on a.name=b.name and a.id='123'

For "Left/Right Join", Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition

Endoderm answered 24/4, 2019 at 12:25 Comment(1)
This depends on the SQL server and on the complexity of the query. A lot of SQL implementations would optimize simple queries like this for the best performance. Perhaps provide an example server name and version where this behavior happens to improve the answer?Promontory

© 2022 - 2024 — McMap. All rights reserved.