How costly are JOINs in SQL? And/or, what's the trade off between performance and normalization?
Asked Answered
S

8

22

I've found a similar thread but it doesn't really capture the essence of what I'm trying to ask - so I've created a new thread.

I know there is a trade-off between normalization and performance, and I'm wondering what's the best practice for drawing that line? In my particular situation, I have a messaging system that has three distinct tables: messages_threads (overarching message holder), messages_recipients (who is involved), and messages_messages (the actual messages + timestamps).

In order to return the "inbox" view, I have to left join the messages_threads table, users table, and pictures tables to the messages_recipients tables in order to get the information to populate the view (profile picture, sender name, thread id)... and I've still got add a join to messages to retrieve the text from the last message in order to display a "preview" of the last message to the user.

My question is: How costly are JOINS in SQL to performance? I could, for instance, store the sender's name (which I have to left join from users to retrieve) under a field in the messages_threads table called "sendername" - but in terms of normalization I've always been taught to avoid data redundancy?

Where do you draw the line? Or am I overestimating how performance-hampering SQL joins are?

Saxony answered 24/4, 2011 at 22:19 Comment(0)
R
32

The best practice is to always start with 3NF, and then only consider denormalistion if you find a specific performance problem.

Performance is just one of the issues you have to deal with with databases. By duplicating data, you run the risk of allowing inconsistent data to be in your database, thus nullifying one of the core principles of relational databases, consistency (the C in ACID) a.

Yes, joins have a cost, there's no getting around that. However, the cost is usually a lot less than you'd think, and can often be swamped by other factors like network transmission times. By making sure the relevant columns are indexed properly, you can avoid a lot of those costs.

And, remember the optimisation mantra: measure, don't guess! And measure in a production-like environment. And keep measuring (and tuning) periodically - optimisation is only a set and forget operation if your schema and data never change (very unlikely).


a) Reversion for performance can usually be made safe by using triggers to maintain consistency. This will, of course, slow down your updates, but may still let your selects run faster.

Renfred answered 24/4, 2011 at 22:33 Comment(2)
Thanks pax, you're right - I should stick to ACID. Thanks for clearing that up to me, I read an article about bigger sites denormalizing and started to question my structure.Saxony
@Walker, denormalisation is sometimes a viable option. You just have to ensure it's going to help more than hinder :-) As with most of life, there are trade-offs.Renfred
F
3

I wouldn't worry that much about an extra join. In my experience, the big performance loss from joins comes when you're joining large data sets. Presumably, your messages view will display 20-100 rows tops.

One thing though - if you don't need a left join, just use a regular join. It takes a surprisingly significant amount of extra time for a left join vs a regular join.

If you're really curious, you can set up a benchmark. PHPMyAdmin tells you how long a query took to run; you can check the time with and without the final join. (Just bear in mind that all phpmyadmin select queries are limited, so you can expect a longer execution time if you're selecting more than 20 rows).

Flong answered 24/4, 2011 at 22:30 Comment(3)
phpMyAdmin is not a tool suited for this kind of work. You should use a desktop tool like EMS MySQL Manager, Navicat for MySQL or some other.Cubiform
If you need data from two large datasets, JOINS are typically the most efficient way to get it.Triciatrick
I have Sequel Pro which works quite well for testing performance - I had no idea LEFT JOIN was more costly then JOIN though - I'll definitely have to go back and replace a lot of LEFT JOIN statements throughout my code.Saxony
C
3

There is no simple answer to that question. Joins costs vary greatly depending on available indexes, number of records and many other factors. AFAIR in MySQL there are at least a couple of join strategies that are sorted from best to worst case scenario.

In practise you need to make the schema according to the general rules regarding the data security - so do normalize your database when it's needed.

Denormalization should happen only if you have a real performance problem and there is no other way to solve it (eg. adding an index, changing parameters, rewriting the query, ...) and should be based on deep analysis of the problem.

Cubiform answered 24/4, 2011 at 22:32 Comment(0)
U
3

This is one of the ideal use cases for de normalization. None of the data is going to change after the initial message is sent, the sender, the receiver, and the message will stay the same, only new messages will be appended, potentially there could be 10k messages. There are easy benchmarks to prove this is a 4x performance improvement while sacrificing a little storage. Data integrity will not be an issue and no triggers are necessary.

Uniform answered 21/7, 2023 at 12:54 Comment(0)
G
2

From my experience, the impact of extra JOIN segments in a query is generally not going to make or break the application. Indexing, avoiding subqueries, and sometimes avoiding LEFT JOIN statements will make the biggest impact.

As Sam Dufel mentions, set up a benchmark to see if the LEFT JOIN you're using should be worked around. It might also be useful to generate a bunch of dummy data to see if it scales as the number of records in the JOIN increase.

Ghost answered 24/4, 2011 at 22:36 Comment(0)
T
1

Joins are a strategy for improving the efficiency of a query. And contrary to another response, outer joins are just as efficient as inner joins in every product I've had a chance to text, which includes MySQL (both major engines), SQL Server, Sybase, and Oracle.

What's to avoid is subqueries (primarily correlated subqueries), which is commonly used as an alternative.

Triciatrick answered 24/4, 2011 at 22:37 Comment(0)
N
1

ALWAYS ALWAYS prefer normalization. It is appalling to me that denormalization STILL gets this kind of attention.

NORMALIZE - thats what the database engines are tuned for.

Nancinancie answered 24/4, 2011 at 22:42 Comment(2)
Thanks Randy, the only reason I questioned it is reading about the denormalization of Twitter.Saxony
yes - and i probably overreacted. but you shuold really not even consider it until you prove that you are having some issue with your properly normalized system.Nancinancie
K
1

It's not possible, or useful, to answer a question about how costly joins are.

A join is just a command in the SQL query, what the database does with that join is something completely different. What's expensive in a query is things like table scans, where the database has to read an entire table to locate some data. A query with ten joins on tables where there are useful indexes can be much faster than a query on a single table without any useful indexes.

Three or four joins in a query is certainly not any reason to de-normalise the tables to try to improve performance. As comparison; for our web site we are using a de-normalised table to read from, because we would need about 40 joins to gather the data that we need.

Khichabia answered 24/4, 2011 at 22:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.